Five AIs predict the World Cup
The World Cup rolls around every four years, and with it the office sweepstake, the ritual where everyone gets handed a team at random and pretends to care how it does. I have no interest in football (I had to read a beginner’s guide to how it all works just to follow what’s going on), and I definitely don’t know enough to make an informed prediction of my own. So I thought I’d hand the job to five AI models instead, get them to commit their predictions in public, and keep score.
The result is AIWC26, a page that shows every group-stage prediction from five models side by side, scores them against the real results as they come in, and ranks the models in a league table. It looks like Ceefax, because of course it does.

The setup
Each model got an identical prompt. Here are the 72 group-stage fixtures, predict every scoreline, consider historical World Cup performance, recent form, squad make-up and venue factors, and return strict JSON. The AI entrants were Claude Fable 5 and Claude Sonnet 4.6 (Anthropic), GPT-5.5 (OpenAI), and Gemini 3.1 Pro and Gemini 3.5 Flash (Google).
There’s also a sixth entry that isn’t an AI at all, the Ranking Bot. It knows two things, the FIFA world rankings from April and one rule. Teams within ten places of each other draw 1-1, otherwise the higher-ranked team wins 2-1. Twenty lines of Python, no model, no judgement, fully deterministic, anyone can regenerate its picks and get an identical file. It exists to answer the question I actually cared about. Can five models beat a lookup table?
Scoring is the classic sweepstake format, 3 points for the exact score, 1 point for the right result, nothing otherwise.
All six entries were committed to the repo before today’s opening match, and the git history is the audit trail. Nobody gets to quietly revise their picks after Mexico go 2-0 up.
The build
The site is Jekyll, deployed with GitHub Actions. A custom plugin does the scoring at build time, so the served HTML contains the full tables and works with JavaScript disabled (JS only handles the clock and the group filter).
Entering a result is a one-liner:
$ tools/results.py set A1 2 0
Then commit, push, and the Actions workflow rebuilds and deploys. The raw JSON stays published under /data/ for anyone who fancies their own analysis.
What the picks already tell us
The predictions are locked, so before a ball is kicked I can say what the models think is going to happen. The short version is nothing surprising, ever.
Across all 360 predictions there is not a single major upset. No seed loses to a debutant. Nothing on the scale of Saudi Arabia beating Argentina in 2022, which actually happened, in the opening round, to the eventual champions. The boldest call anywhere on the page is both Gemini models taking Ghana, ranked 74th, to beat 33rd-ranked Panama, which is roughly as spicy as preferring one flavour of rice cake to another.
The strangest consensus on the page is Belgium. All five models hand them a perfect nine points, and so does the Ranking Bot, working purely off the April list that still has them ninth in the world. The press is less sure: previews tend to file Belgium under faded golden generation, big names past their best, a dark horse at most. The gap is the interesting part. The models are not arguing with the rankings, they are inheriting them, and the rankings have been slow to mark down a side that sat at the top of them for years. Five models sounding certain about Belgium is less five opinions than one lagging dataset in five voices, which is the consensus problem the whole page is about.
The consensus runs deeper than individual fixtures. Because the knockout bracket is predetermined, each model’s 72 picks imply complete group tables, and from there an entire last 32 (an extra knockout round added by this year’s 48-team expansion, feeding into the round of 16 as before). Across the five models, 30 of the 32 implied qualifiers are identical. Three teams, Bosnia, Paraguay and Türkiye, are left arguing over the last two seats. Whatever these models are doing, “five independent opinions” isn’t it.
Then there’s the draws problem. A football fan friend looked at the picks and said there’s no way there are enough draws in them. His reasoning is that with the new 48-team format, the top two from each group plus the eight best third-placed teams advance. Two thirds of the field survives the groups. A draw is almost never a bad result, and in the final round it can suit both teams at once. He expects a cagey, draw-heavy group stage, like Euro 2016 produced under the same kind of safety net.
The models, fixture by fixture, mostly disagree.
| Model | Predicted draws (of 72) | Rate |
|---|---|---|
| GPT-5.5 | 21 | 29% |
| Claude Sonnet 4.6 | 13 | 18% |
| Claude Fable 5 | 12 | 17% |
| Gemini 3.5 Flash | 12 | 17% |
| Gemini 3.1 Pro | 10 | 14% |
World Cup group stages have historically run at roughly a quarter draws, before you factor in the new format’s incentives. So most of the field is underweight. The odd part is that the models clearly know draws are common (1-1 is joint-top of all 360 predicted scorelines, level with 2-0), they just don’t apply that base rate when you ask about matches one at a time.
The friend who spotted the draws problem declined to put his own picks on the record, which is fair enough, it’s my website and his reputation.
The bot disagrees with all five models on exactly five fixtures, including calling Brazil v Morocco a draw (they are ranked 6th and 8th, two places apart) and backing Senegal to beat Norway outright. And in a twist I enjoyed, no AI model predicts fewer draws than the bot does: its 10 of 72 (14%) is only matched by Gemini 3.1 Pro at the bottom of the table. So if the draw-heavy theory is right, everyone on this page is wrong together, and the theory’s author is watching from the safety of not having entered.
What happens next
Results go in as the group stage plays out, the table updates, and the colours fill in. Predictions for fixtures that haven’t kicked off yet stay white, exact scores go green, right results go cyan, and wrong ones go red, which I expect to be the dominant colour by the end of June.
I’ll write up the verdict after the group stage finishes on the 27th. Until then the table is at blog.omgmog.net/AIWC26, updating slightly slower than the vidiprinter.