Agentic Company Discovery
An in-house pipeline that bootstraps a client-ready sample of companies for a given sector, using LLM labellers and SetFit classifiers on top of glass.ai's web-scraped data.
Background
Every client engagement started the same way: data scientists spent time on subject-specific research and hand-labelling to stand up a preliminary set of companies the client could react to. That first sample was the thing that unlocked useful feedback - “companies doing X” only became concrete once the client could point at examples and say yes, no, or not quite - and the manual effort to produce it was the main bottleneck on how quickly a project could iterate.
The goal was a system that could bootstrap that initial sample automatically. ChatGPT had just launched, and that was the piece that made the whole thing viable - glass.ai already had mature web-scraping infrastructure and a large base of structured company data, and methods like SetFit had made it practical to train targeted classifiers from very little labelled data. What had been missing was a cheap, general-purpose labeller. Combining LLMs in that role with small classifiers on top of the existing crawl infrastructure was suddenly a credible way to automate the step that had been eating the most analyst time.
Approach
I led the design of a pipeline that turned a rough sector label into a bootstrapped, client-ready sample in a single automated pass:
- Prompt generation agents expand a bare sector label (e.g. “maritime”) into classification prompts broad enough to sweep up the obvious core plus adjacent edges - so the client had room to push back on what was in, what was out, and what additional constraints mattered (location, company size, leadership, and so on). Built with DSPy.
- Subsector and keyword expansion from the sector label - “maritime” fanning out into subsectors like maritime logistics, shipping, ports and terminal operations, and keywords like shipbuilding or offshore renewables. These were run as queries across the web-scraped corpus and internal company data to assemble a broad longlist of candidate companies.
- Dataset construction agents drew a diverse sample from the longlist and had an LLM label it against the generated prompt, iterating until the training set was large enough to fit a classifier.
- SetFit classifiers trained over multiple folds of the labelled set and then run across the full longlist, using agreement across folds as a per-candidate confidence score that flagged genuine classifier uncertainty.
- Sample clustering over the produced set to strip out obvious noise before it reached the client, and to surface a diverse subsample covering the different sub-areas - giving the client something representative to react to, and us a structured signal to refine from.
The system ran in-house. A first deliverable sample that would previously take a data scientist a few days of focused labelling could be produced by the pipeline in a couple of hours - enough of a starting block to get the client feedback loop going, with the shape of the sector determining how much refinement was still needed on top.