Electoral Prediction from Social Media

Background

MSc Data Science thesis at the University of Exeter, supervised by Hywel Williams. The question: given that an enormous volume of political discourse now lives on Twitter - 15 million tweets were posted about the 2019 UK General Election, and 589 of 650 sitting MPs had active accounts - how much of the electorate’s behaviour can actually be recovered from it, and at what spatial resolution?

Existing work had mostly fallen into two camps: studies that declared Twitter could “predict” elections (typically post-hoc, rarely accounting for sample bias), and critics like Gayo-Avello who concluded, flatly, that it couldn’t. Using the 2019 UK election as a case study, I set out to build an honest pipeline - one that treated the demographic mismatch between Twitter and the electorate as a first-class problem, and that operated at the level of individual parliamentary constituencies rather than national vote share.

Pipeline

The dataset was ~7.6 million UK-located tweets collected via the Streaming API across the six-week run-up to polling day, plus retweet histories for ~340k users gathered in a second pass (retweets aren’t location-tagged, so they have to be pulled from user timelines separately).

Probabilistic constituency binning. Twitter location tags are usually bounding boxes, and they often span multiple constituencies (a single “Newcastle” box covers three). I treated each tweet’s constituency membership as a probability distribution proportional to area overlap, aggregated per user into an inferred home constituency, and then supported two downstream aggregation modes: fractional weighting of votes across constituencies, or stochastic sampling across thousands of simulations - the latter producing probability distributions over seat counts rather than point estimates.

Political relevance filter. To strip off the noise (political tweets are a small minority of any given corpus), I hand-labelled ~5,000 tweets via a Jupyter UI and trained an AdaBoost classifier on bag-of-words features. 90% accuracy on held-out data, beating a Naive Bayes baseline by 4 points.

Three parallel methods for inferring a user’s partisan leaning:

Retweet-based. Assemble a comprehensive set of 2,546 political figures across seven parties (candidates from Democracy Club plus scraped MSPs, MSs, and former UK MEPs) and label each user with the party whose figures they most retweeted. The simplest of the three, and - as it turned out - the strongest.
Network-based. Build a retweet graph over politically-filtered tweets (300k nodes, 1.1M edges) and run the Leiden community-detection algorithm seeded with a partition that places each party’s politicians in their own cluster. 50 runs, majority-vote labelling. Inspired by Conover et al.’s finding that retweet topology is strongly partisan.
Text-based. Train a fastText classifier on tweets from users who only ever retweeted a single party (distant supervision), classify each user’s tweets, and aggregate into a single label using the geometric median of per-tweet probability vectors - more robust to outliers than a mean.

Bias correction via simulated annealing. Even with perfect per-user inference, the Twitter population doesn’t match the electorate - younger, more urban, more politically engaged, measurably more left-wing. I fit a 7-dimensional multiplicative weighting vector (one per party) with simulated dual annealing, minimising squared error against real constituency results on a random 50% split. Ground-truth vote shares were fed in unnormalised - constituencies where an independent took 40% would otherwise blow out the optimisation.

Results

The bias-corrected retweet method was the strongest predictor. Across all 631 British constituencies:

Per-party vote share MAE: 0.061
Seat accuracy: 80.2% (506 of 631 seats correctly called)
Top-2 seat accuracy: 98.4% - only 10 seats had the real winner outside the model’s top two predictions; 7 of those 10 were Liberal Democrat

Under stochastic aggregation across 1,000 simulations - with the bias-correcting vector re-learned every 10 runs - every single run produced a Conservative outright majority, with peak seat counts for both major parties falling close to the real outcome.

Null-model validation. To check that bias correction wasn’t silently doing all the work, I ran two null models before applying it: one with 50% of user locations randomly swapped, another with 50% of ground-truth seat results swapped. Both degraded every metric. The bias-correction step is amplifying real localised signal in Twitter data, not inventing it.

Red Wall analysis. Of the 44 northern and midlands constituencies Labour lost in 2019 - the “Red Wall” - the model called 73% correctly, and median computed Labour vote share in these seats sat between the Labour-win and Conservative-win medians, matching what actually happened politically.

Honest limits

Bias correction fits against ground-truth vote shares, so this is analysis, not live prediction - a caveat Coletto et al. acknowledge in their own work. The weighting vector should be reasonably stable absent major demographic shifts, and could plausibly be seeded from traditional polling or a prior election, but that’s future work rather than a finished claim.

The network-based method struggled in a multi-party context. The Brexit Party and Conservatives collapsed into a single cluster (their retweet patterns were too similar); the cluster labelled Plaid Cymru turned out to be mostly Welsh Labour figures. Methods designed around binary US-style partisan systems don’t transplant cleanly to multi-party regional politics - itself an interesting finding.

Outcome

Graded 77% with a publication recommendation. More importantly, it kicked off an interest in computational social science and applied NLP that’s threaded through my work since.

Full thesis (50 pages): “A Bird’s Eye View: To what extent can we socially sense local attitudes to political parties using Twitter?” - C. Tyson, University of Exeter, 2020.