Agentic Company Discovery

Background

Every client engagement started the same way: data scientists spent time on subject-specific research and hand-labelling to stand up a preliminary set of companies the client could react to. That first sample was the thing that unlocked useful feedback - “companies doing X” only became concrete once the client could point at examples and say yes, no, or not quite - and the manual effort to produce it was the main bottleneck on how quickly a project could iterate.

The goal was a system that could bootstrap that initial sample automatically. ChatGPT had just launched, and that was the piece that made the whole thing viable - glass.ai already had mature web-scraping infrastructure and a large base of structured company data, and methods like SetFit had made it practical to train targeted classifiers from very little labelled data. What had been missing was a cheap, general-purpose labeller. Combining LLMs in that role with small classifiers on top of the existing crawl infrastructure was suddenly a credible way to automate the step that had been eating the most analyst time.

Approach

I led the design of a pipeline that turned a rough sector label into a bootstrapped, client-ready sample in a single automated pass:

Prompt generation agents expand a bare sector label (e.g. “maritime”) into classification prompts broad enough to sweep up the obvious core plus adjacent edges - so the client had room to push back on what was in, what was out, and what additional constraints mattered (location, company size, leadership, and so on). Built with DSPy.
Subsector and keyword expansion from the sector label - “maritime” fanning out into subsectors like maritime logistics, shipping, ports and terminal operations, and keywords like shipbuilding or offshore renewables. These were run as queries across the web-scraped corpus and internal company data to assemble a broad longlist of candidate companies.
Dataset construction agents drew a diverse sample from the longlist and had an LLM label it against the generated prompt, iterating until the training set was large enough to fit a classifier.
SetFit classifiers trained over multiple folds of the labelled set and then run across the full longlist, using agreement across folds as a per-candidate confidence score that flagged genuine classifier uncertainty.
Sample clustering over the produced set to strip out obvious noise before it reached the client, and to surface a diverse subsample covering the different sub-areas - giving the client something representative to react to, and us a structured signal to refine from.

The system ran in-house. A first deliverable sample that would previously take a data scientist a few days of focused labelling could be produced by the pipeline in a couple of hours - enough of a starting block to get the client feedback loop going, with the shape of the sector determining how much refinement was still needed on top.