Forecasts who wins the World Cup by simulating the whole remaining tournament thousands of times from current team ratings. Updates automatically as real results come in. Per-match prediction and a self-grading ledger live in the tabs below.
Each run plays the rest of the tournament forward from current ratings.
Enter a finished score. If a locked prediction exists for this match, it gets graded first, then the model learns. No prediction on file? It still learns — it just won't count toward pre-match accuracy.
At the end of each day's matches, the model locks in who it favored to win the World Cup — a timeline of how the prediction shifted across the tournament. Each entry is frozen in the shared file, so it's the same for everyone.
Every prediction here was frozen by the model before kickoff and saved to the shared results file — so this record is identical for everyone and can't be edited after the fact. It's the model's public scorecard.
Standings from logged results. Top 2 of each group (green) advance automatically; the 8 best third-place teams also go through, so these aren't final.
The field is the real 48 teams. Add, remove, or rename teams and tune starting ratings if needed.
The rating model (the “ML” core). Every team carries a strength number. To predict, the model feeds the rating gap through a logistic curve — P(A wins) = 1 / (1 + 10^((Rb − Ra)/400)) — the same math as chess Elo. A draw probability is carved out, larger when teams are even. Neutral venue gives no home bump; otherwise team A gets +60.
The pre-match ledger. When you lock a prediction, the app stamps it with the probabilities and the time, and parks it as pending. Nothing about the result is known yet. When the score lands, that locked call is graded — was the favorite right, and what was its Brier score — and only those graded, ahead-of-time calls feed the pre-match hit rate and pre-match Brier at the top. That's the number that actually tells you if the model predicts well.
The learning loop (the “RL” part). After grading, the model shifts both ratings toward the truth: Δ = K · margin · (actual − expected). Upsets move ratings hard; expected results barely nudge them. A 4–0 teaches more than a 1–0; knockouts can weigh heavier.
Recent form (momentum). On top of the base rating, the model tracks each team's last few games and adds a bounded form nudge. Crucially it's opponent-adjusted: drawing Spain and Uruguay says far more than beating a minnow, because only surprising results build form — an expected win adds almost nothing. The newest three games weigh heaviest, it caps at five, and goal difference modulates it. A single game can't max it out; sustained form can. That's why a team on a real run against strong opposition can be favored over a higher-rated side that's been flat.
Pre-loaded. Ships trained on the real round-1 group results, so ratings already reflect what's happened. Those seed results are learned-from but not counted as pre-match calls — you hadn't predicted them in advance.
Honest limits. It only sees ratings — not injuries, rotation, rest, or tactics — so it won't beat a Vegas book. It's a transparent learner you can watch reason and hold accountable.