Agentic Company Discovery

Head of Machine Learning · glass.ai · 2023–2024

An in-house pipeline that bootstraps a client-ready sample of companies for a given sector, using LLM labellers and SetFit classifiers on top of glass.ai's web-scraped data.

Background

Every client engagement started the same way: data scientists spent time on subject-specific research and hand-labelling to stand up a preliminary set of companies the client could react to. That first sample was the thing that unlocked useful feedback - “companies doing X” only became concrete once the client could point at examples and say yes, no, or not quite - and the manual effort to produce it was the main bottleneck on how quickly a project could iterate.

The goal was a system that could bootstrap that initial sample automatically. ChatGPT had just launched, and that was the piece that made the whole thing viable - glass.ai already had mature web-scraping infrastructure and a large base of structured company data, and methods like SetFit had made it practical to train targeted classifiers from very little labelled data. What had been missing was a cheap, general-purpose labeller. Combining LLMs in that role with small classifiers on top of the existing crawl infrastructure was suddenly a credible way to automate the step that had been eating the most analyst time.

Approach

I led the design of a pipeline that turned a rough sector label into a bootstrapped, client-ready sample in a single automated pass:

The system ran in-house. A first deliverable sample that would previously take a data scientist a few days of focused labelling could be produced by the pipeline in a couple of hours - enough of a starting block to get the client feedback loop going, with the shape of the sector determining how much refinement was still needed on top.