This post was written by Adi Lerman, Data Scientist at Start.io.
What can the content you consume tell advertisers about your interests, behavior, and buying intent?
In a recent project, Start.io’s data science team built software that uses large language models (LLMs) to classify webpage content for audience segmentation, ad optimization, and ad-blocking. This technical article goes into detail about how we built the software.
There are 3 goals for the webpage content categorization: enhance the quality of our audience segmentation, better optimize our ad-serving algorithm, and serve advertisers who don’t want their ads running next to specific content categories, such as political content.
How we built it
At Start.io, our classification platform is built on a robust web-scraping and content processing pipeline. Incoming webpages are parsed to extract meaningful textual content, and routed through a multi-stage classification system leveraging large language models (LLMs), embedding-based similarity scoring, and retrieval-augmented generation (RAG) architectures.
Our testing framework
A core component of our infrastructure is a dynamic evaluation and benchmarking framework designed to optimize configuration performance. Given the complexity of audience classification and the nuances in web content, traditional evaluation metrics are insufficient. Our testing framework assesses each configuration using three key performance indicators:
- Direct Hit Score: Measures the proportion of predictions that exactly match ground-truth labels.
- RAG Inclusion Score: Evaluates the frequency with which the true label appears within the candidate set returned by the RAG module. This reflects the ability of the retrieval system to surface semantically relevant options.
- Embedding Similarity Sum: Computes the cumulative similarity between the predicted label embeddings and the ground-truth label embeddings. This accounts for semantic closeness, acknowledging that some misclassifications (e.g., predicting a semantically similar category) are more tolerable than others (e.g., predicting an entirely unrelated label).
This three-part evaluation enables us to distinguish between critical and non-critical errors, prioritize high-impact improvements, and systematically converge on optimal configurations.
Model and configuration comparisons
We rigorously tested a range of LLM and embedding model combinations to identify the most performant and cost-effective setup for our use case.
For LLM inference, we compared open-source models such as Mistral and LLaMA against commercial offerings, including OpenAI’s GPT-4.0 mini. While the open-source models demonstrated solid performance, GPT-4.0 mini consistently outperformed in classification accuracy and generalization, making it a preferred choice in high-precision environments.
In the RAG component, we evaluated various embedding models for document and query vectorization. Despite the availability of newer commercial options, open-source MPNet embeddings delivered the best results in terms of semantic retrieval precision, efficiency, and inference speed. These embeddings provided robust performance for our domain-specific classification tasks.
To avoid evaluation bias, we deliberately decoupled the embedding model used for Embedding Similarity Sum from the one used in the RAG module. While MPNet was used in RAG, a different embedding model—chosen specifically for its alignment with the semantic space of our labels—was employed in the similarity evaluation. This separation ensures a more objective and reliable measurement of classification quality, independent of the retrieval model’s embedding characteristics.
We also conducted extensive prompt engineering experiments to maximize LLM performance. This included exploring few-shot learning techniques by varying both system-level prompts and user/assistant message structures to guide the model toward more accurate and contextually appropriate outputs. In addition, we introduced custom heuristics to mitigate hallucinations, such as fallback logic for predictions that do not originate in the RAG response. These enhancements significantly improved the robustness and reliability of our pipeline.
Development and deployment
Model selection and system tuning are continuous processes. Our modular pipeline allows for experimentation with various LLMs, embedding spaces, and RAG strategies. This flexibility is essential for keeping pace with the dynamic nature of web content and audience behavior.
In production, the system operates in a semi-supervised mode, where configuration updates are driven by observed performance shifts, A/B testing outcomes, and periodic re-evaluation against curated validation sets.
Want to work on data science projects at Start.io? We’re hiring! Check out Start.io careers today.