This post was written by Adi Lerman, Data Audiences Team Leader at Start.io.
At Start.io, we work with massive amounts of audience data generated from real-time ad-serving signals. Every user can belong to multiple audiences, ranging from demographic groups like “Ages 40-45” to behavioral segments such as “Exotic Destination Travelers,” “Sports Enthusiasts,” or “Luxury Shoppers”. Traditionally, look-alike modeling focuses on finding users who resemble a seed audience based on predefined behavioral features. But we wanted to explore a different direction:
Treating Audiences Like Words
To tackle this challenge, we borrowed ideas from Natural Language Processing (NLP), specifically embedding models inspired by Word2Vec.
In a traditional Word2Vec problem:
- Words appear inside sentences
- Context and order matter
- The model learns semantic relationships between words
Our use case was very different. Instead of words, we used audience IDs. Instead of sentences, we used the collection of audiences assigned to each user. A single user effectively became a “document” composed of audience memberships. Unlike natural language, there is no true order to these audience IDs. A user belonging to “Frequent Travelers” and “Luxury Consumers” carries the same meaning regardless of sequence. Window size and syntax become largely irrelevant. Yet despite breaking many assumptions of traditional NLP, the embedding approach still worked remarkably well.
Learning Behavioral Relationships Between Audiences
By training a Word2Vec-like embedding model on audience co-occurrence patterns, we generated dense vector representations for audience segments. What emerged was surprisingly meaningful. Audiences that frequently appeared together across users became positioned close to each other in embedding space. Over time, clusters naturally formed around broader behavioral or contextual themes.
For example:
- Travel-related audiences grouped together
- Financial and luxury-oriented audiences formed neighboring clusters
- Gaming and entertainment audiences developed their own embedding regions
This gave us something powerful: A numerical representation of behavioral similarity between audiences, learned entirely from large-scale user behavior.
Creating the “Average User” of a Seed Audience
Once we had audience embeddings, we could construct look-alike models in a very lightweight but scalable way.
For every user in a seed audience:
- We aggregated the embeddings of all audiences assigned to that user
- This produced a single vector representation of the user’s overall behavioral profile
We then aggregated all user vectors inside the seed audience to create what we internally viewed as:
- The “center of gravity” of the audience.
- This vector represented the behavioral fingerprint of the seed audience as a whole.
Ranking Look-Alike Candidates
With this audience centroid established, generating look-alike users became straightforward.
For each candidate user outside the seed audience:
- We computed their own aggregated embedding vector
- We measured cosine similarity against the seed audience centroid
- We ranked candidates by distance from the audience center
The closer a candidate was to the centroid, the more behaviorally similar they were to the original audience.
This approach gave us something extremely valuable operationally:
- A controllable expansion mechanism
- Adjustable scale vs. quality tradeoffs
- Efficient ranking at very large scale
- A reusable embedding framework across many audience types
Validating the Model
The most important question was whether embedding distance actually correlated with real-world audience affinity – It did. As candidate users became farther from the seed audience centroid, we observed measurable declines in behavioral similarity and downstream audience performance metrics. This confirmed that the embedding space was capturing meaningful behavioral structure rather than random mathematical proximity.
From Experiment to Production
What started as an unconventional application of NLP concepts evolved into a production-grade look-alike system generating real business value today. The project demonstrated something we find increasingly important in machine learning:
- Some of the most effective ideas emerge when techniques are applied outside their original domain.
- Language models helped us understand audiences not because audiences are language, but because large-scale behavioral relationships can often be learned the same way semantic relationships are learned in text.
- Sometimes patterns tell the whole story. We leave statistical signals everywhere and then act surprised when machines notice.