Custom Loss Functions for Low-Label Classification

Background

Two production problems at glass.ai kept running into the same obstacle: the training data was cheap, but no off-the-shelf loss function did justice to the supervision signal we actually had. Cross-entropy assumes clean negatives. Cosine distance over embeddings doesn’t calibrate. Published PU losses like nnPU gave ineffective results on high-dimensional text embeddings. In both cases, the classifier architecture wasn’t the interesting thing - the loss was.

Adversarial PU classification

The first setup was classifying companies from positive and unlabelled data. Positives were cheap to assemble from directories, articles and past deliveries; the unlabelled pool was our internal company database plus anything we wanted to crawl. Standard PU methods underperformed on embedding-space inputs - the off-the-shelf estimators seemed built for lower-dimensional features and didn’t produce a well-calibrated boundary on ours.

The key inspiration was Predictive Adversarial Networks (PAN, Hu et al., 2021) - a paper that re-casts PU learning as a GAN-like game. A selector picks candidates from the unlabelled pool that look positive-like enough to fool a discriminator, and the discriminator tries to tell the selector’s picks apart from real positives. The striking thing about PAN wasn’t the architecture - it was that the contribution was really a custom, KL-divergence-based objective function, carefully hand-designed to make the adversarial game work for PU. Reading it, the takeaway felt like: in this setting, the loss is doing most of the work.

That suggested a natural question. If the loss is the lever, could we search for it directly, rather than hand-design another variant? I set up a 4-variate Taylor-series family over the four quantities that obviously mattered - the selector’s and discriminator’s outputs on positive and on unlabelled data - and used a genetic algorithm to evolve the coefficients against held-out validation accuracy. Fitness was essentially “which loss actually lets the selector and discriminator converge to a usable boundary.” The GA-evolved loss was stable enough to train the paired models reliably and deploy them in glass.ai’s pipelines.

DPO-style preference classifiers

Training stability on the adversarial system wasn’t always there, and while working around it I ended up on a related track: loss functions for pairwise signals. LLMs were good at pairwise judgements but too slow and expensive to sit on the hot path - what we wanted was a cheap small-model classifier that captured the LLM’s preferences.

I adapted DPO loss to train small transformers on LLM-labelled preference pairs. Two deployments came out of it:

Seniority ordering of a company’s staff from their job titles. The LLM labelled pairs like (“VP Engineering”, “Senior Engineer”) once; the trained classifier served pairwise comparisons in the pipeline thereafter.
Topical relevance scoring between companies - “is X or Y more relevant to topic Z” - seeded with LLM-labelled pairs and used as a cheap reranker.

At the time this was an unusual but effective way to distil an LLM’s judgement into a small discriminator, and both models saw production use.

Outcome

Two classifiers solving real glass.ai problems whose common thread was a training signal that no standard loss expressed well - one needing to discriminate positives from a messy unlabelled pool without collapsing, the other needing to match an LLM’s pairwise judgement cheaply at runtime. The work made loss engineering - searching over losses, borrowing from adjacent literatures - a viable first-class move rather than a last resort.