Rethinking Look-Alike Modeling with NLP-Inspired Embeddings - What if audiences themselves could become a semantic language? - Start.io - A Sell-side Omnichannel Advertising Platform

This post was written by Adi Lerman, Data Audiences Team Leader at Start.io.

At Start.io, we work with massive amounts of audience data generated from real-time ad-serving signals. Every user can belong to multiple audiences, ranging from demographic groups like “Ages 40-45” to behavioral segments such as “Exotic Destination Travelers,” “Sports Enthusiasts,” or “Luxury Shoppers”. Traditionally, look-alike modeling focuses on finding users who resemble a seed audience based on predefined behavioral features. But we wanted to explore a different direction:

Treating Audiences Like Words

To tackle this challenge, we borrowed ideas from Natural Language Processing (NLP), specifically embedding models inspired by Word2Vec.

In a traditional Word2Vec problem:

Words appear inside sentences
Context and order matter
The model learns semantic relationships between words

Our use case was very different. Instead of words, we used audience IDs. Instead of sentences, we used the collection of audiences assigned to each user. A single user effectively became a “document” composed of audience memberships. Unlike natural language, there is no true order to these audience IDs. A user belonging to “Frequent Travelers” and “Luxury Consumers” carries the same meaning regardless of sequence. Window size and syntax become largely irrelevant. Yet despite breaking many assumptions of traditional NLP, the embedding approach still worked remarkably well.

Learning Behavioral Relationships Between Audiences

By training a Word2Vec-like embedding model on audience co-occurrence patterns, we generated dense vector representations for audience segments. What emerged was surprisingly meaningful. Audiences that frequently appeared together across users became positioned close to each other in embedding space. Over time, clusters naturally formed around broader behavioral or contextual themes.

For example:

Travel-related audiences grouped together
Financial and luxury-oriented audiences formed neighboring clusters
Gaming and entertainment audiences developed their own embedding regions

This gave us something powerful: A numerical representation of behavioral similarity between audiences, learned entirely from large-scale user behavior.

Creating the “Average User” of a Seed Audience

Once we had audience embeddings, we could construct look-alike models in a very lightweight but scalable way.

For every user in a seed audience:

We aggregated the embeddings of all audiences assigned to that user
This produced a single vector representation of the user’s overall behavioral profile

We then aggregated all user vectors inside the seed audience to create what we internally viewed as:

The “center of gravity” of the audience.
This vector represented the behavioral fingerprint of the seed audience as a whole.

Ranking Look-Alike Candidates

With this audience centroid established, generating look-alike users became straightforward.

For each candidate user outside the seed audience:

We computed their own aggregated embedding vector
We measured cosine similarity against the seed audience centroid
We ranked candidates by distance from the audience center

The closer a candidate was to the centroid, the more behaviorally similar they were to the original audience.

This approach gave us something extremely valuable operationally:

A controllable expansion mechanism
Adjustable scale vs. quality tradeoffs
Efficient ranking at very large scale
A reusable embedding framework across many audience types

Validating the Model

The most important question was whether embedding distance actually correlated with real-world audience affinity – It did. As candidate users became farther from the seed audience centroid, we observed measurable declines in behavioral similarity and downstream audience performance metrics. This confirmed that the embedding space was capturing meaningful behavioral structure rather than random mathematical proximity.

From Experiment to Production

What started as an unconventional application of NLP concepts evolved into a production-grade look-alike system generating real business value today. The project demonstrated something we find increasingly important in machine learning:

Some of the most effective ideas emerge when techniques are applied outside their original domain.
Language models helped us understand audiences not because audiences are language, but because large-scale behavioral relationships can often be learned the same way semantic relationships are learned in text.
Sometimes patterns tell the whole story. We leave statistical signals everywhere and then act surprised when machines notice.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Treating Audiences Like Words

Learning Behavioral Relationships Between Audiences

Creating the “Average User” of a Seed Audience

Ranking Look-Alike Candidates

Validating the Model

From Experiment to Production

When Being Right is Wrong: The Paradox of Model Accuracy

Mobile game downloads decline for fourth year, engagement and revenue up

New data: AI apps helped push mobile in-app purchases to $150B in 2024