Markets vs the experts

The claim that a crowd with money beats credentialed forecasters isn't a vibe — it's one of the better-documented findings in social science. The evidence, and its limits.

Earlier in this series I made a strong claim and waved at the evidence: a liquid prediction market tends to forecast better than polls, expert panels, and statistical models.¹ That's the kind of sentence that sounds like marketing. A crowd of anonymous traders out-predicting credentialed analysts with decades of training? It has the shape of a too-good story.

So this piece pays the debt. I want to put the actual research on the table — what's been measured, by whom, against what — and then, just as carefully, draw the lines around it. Because the honest version of the claim is more interesting than the slogan. Markets do tend to win. The margin is real, repeatable, and modest — and it comes apart under specific, knowable conditions. Knowing where the edge lives is worth more than cheering for it.

The interesting question was never "are experts useless." It's "what reliably beats them, and by how much."

First, the case against the expert

Start with the credentialed forecaster, because that's the benchmark everyone assumes is strong. It isn't — and we know that from one of the most patient experiments in the history of social science.

Over roughly two decades, the psychologist Philip Tetlock collected something like 82,000 probabilistic forecasts from 284 experts — political scientists, economists, intelligence professionals, the people TV bookers call when something happens. He then waited for the questions to resolve and graded them. The finding, published in Expert Political Judgment in 2005, is now famous: the average expert was barely better than chance at assigning probabilities to political and economic outcomes, and on many questions was beaten by simple statistical rules — even by a naïve baseline that just extrapolated the recent past.² Tetlock's most-quoted line is that the median expert was "roughly as accurate as a dart-throwing chimpanzee." The barb is unfair to the best of them, but the distribution is the point.

Two patterns inside the data matter more than the punchline. First, confidence and accuracy were essentially unrelated — the experts who were most certain, and most in demand on television, were not the most accurate; if anything, fame correlated with overconfidence. Second, cognitive style mattered more than credentials: Tetlock's "foxes," who held many small models and updated readily, beat his "hedgehogs," who explained everything through one big idea. The lesson isn't that expertise is worthless. It's that a single expert, however decorated, is a high-variance instrument that nobody is grading — and an ungraded forecaster drifts toward confident and wrong.

Then it got weirder

The natural objection: fine, pundits are overconfident, but that's a soft domain. Put real professionals on a hard question with real stakes and the gap closes. It was tested. It didn't.

After 9/11, the U.S. intelligence community funded a tournament to find out what actually improves forecasting. Several university teams competed to predict geopolitical events — coups, elections, currency moves, conflicts. Tetlock and Barbara Mellers ran one of them, the Good Judgment Project, and it won so decisively that the others were dropped.³ The detail that should stop you: a few hundred of GJP's best volunteers — a retired pharmacist, a homemaker, a software engineer doing this as a hobby — were measurably more accurate than professional intelligence analysts working the same questions with access to classified information. The civilians, on open-source data, reportedly beat the analysts by a margin variously reported around 30%.³

How? Two ingredients, and you'll recognize both because they're exactly what a market supplies for free. The superforecasters were relentless updaters — they moved their estimates in small increments as news arrived instead of planting a flag and defending it. And the project aggregated across many forecasters, then sharpened the result, so no single person's blind spot survived contact with the group. Skill plus aggregation plus updating. Hold that thought.

The throughline

Tetlock's two decades of evidence converge on the same mechanism, whether the forecaster is a pundit, an analyst, or a hobbyist: aggregate many views, weight the better ones, and update continuously. A prediction market is a machine that does all three by construction — which is why the next section is short. The hard part was believing the mechanism beats the expert. The data already said it does.

Now point it at markets

The Good Judgment work is about people forecasting. A market is the same idea with the aggregation and the updating welded into the price — and with money deciding whose view counts. So does the welded-together version actually beat the alternatives? The cleanest long-run evidence comes from elections, because elections resolve unambiguously and we have decades of them.

The Iowa Electronic Markets — a real-money exchange the University of Iowa has run for research since 1988 — is the canonical dataset. Across a large set of U.S. elections, on most days of most campaigns, the IEM's closing prices were closer to the eventual result than the contemporaneous opinion polls.⁴ In the studied presidential races, the market beat the poll the clear majority of the time, and — this is the part pollsters dislike — it was especially better far from election day, when polls are noisiest and least predictive. The market wasn't running a secret model. It was letting people with skin in the game price the polls, discount the noise, and re-quote continuously.

And the prices weren't just directionally right; they were well-calibrated as probabilities. When this kind of market says 70%, the thing happens about 70% of the time — which is the property I spent a whole earlier part on. If you want the machinery for grading that claim — calibration curves, the Brier score, why a market's number is honest rather than just confident — it's here: A price is a probability.⁵ For this piece, take the result: liquid markets clear the bar that experts mostly fail. Their probabilities mean what they say.

Fig 1 — relative forecasting accuracy · illustrative · ranking after Tetlock (2005), Mellers et al. (2014), Berg et al. (2008)

Why the market wins

None of this is luck, and none of it is magic. Four structural properties do the work — the same four that made the superforecasters good, except a market enforces them automatically instead of relying on unusually disciplined humans.

It aggregates dispersed knowledge. This is the deepest reason, and it's old. In 1945 Friedrich Hayek argued that the knowledge a society needs is never concentrated in one mind — it's scattered as "dispersed bits of incomplete and frequently contradictory knowledge" across thousands of people, and a price is the device that pulls those fragments into one public number.⁶ A prediction market is that machine aimed at the future: every trader stakes their private sliver, and the price collapses all of it into a single estimate that no individual participant could have produced alone. I unpacked this fully in the flagship — Why prediction markets matter — but it's worth restating here because it's the whole reason the crowd can beat the credential. The expert knows one big thing. The market knows a little of everyone's thing.

Skin in the game filters cheap talk. A poll answer is free, and a wrong one costs nothing. A trade is a position, and being wrong costs money. That single difference is a filter: it discounts the confident-but-uninformed and rewards the people who've actually done the work, because they're the ones willing to bet size. A pundit can be wrong on air for a decade and keep the gig. A trader who's wrong goes broke and stops moving the price. The market doesn't ask who has credentials; it asks who's willing to pay to be heard, and then makes them pay if they're wrong.

It updates continuously. A poll is a snapshot taken every few days; a committee memo is a snapshot taken once. A market re-prices the instant the world changes, because the first trader to understand a piece of news has a profit waiting if they act before everyone else. Forecasting accuracy is mostly about not being stale, and a market is the only forecaster that's never stale by construction.

No single point of failure. Every individual forecaster has correlated blind spots — a worldview, a bad week, an ego invested in last month's call. Averaging across many independent-ish errors cancels a lot of them; the aggregate is more robust than any term in it. This is just the mathematics of aggregation, and it's why "the crowd" so often beats "the star." The market has no ego to defend and nothing to be embarrassed about. It only has a price.

Fig 2 — why it works: dispersed signals → skin-in-the-game weighting → one number sharper than its inputs

Where markets don't win

Here's the part the slogan leaves out, and it's the part I'd actually argue about. A market is not an oracle. It's a specific mechanism with specific failure modes, and the same evidence that vindicates it also fences it in. If you only remember the cheerleading, you'll trust a price exactly when you shouldn't.

Thin liquidity breaks everything. Every result above carries the same qualifier — when the market is liquid enough to function. The edge is a property of depth, not of the format. On a market with three traders and ten dollars of volume, the "price" is one person's hunch wearing a probability costume; a single careless order swings it. The mechanism that aggregates dispersed knowledge needs dispersed knowledge actually showing up. A thin market is not a wise crowd. It's a loud individual.

Markets can be confidently, expensively wrong. Calibration is a property of the long run, not a guarantee on any one event — exactly the point from the calibration piece. Speculative bubbles are markets being wrong in public for extended stretches; prediction markets are not immune to the same herding, thin-float distortions, and the occasional whale with an agenda. "Well-calibrated over thousands of contracts" and "right about the one you care about tonight" are different promises, and only the first is supported.

The longshot bias is real. Markets carry a persistent, well-documented wrinkle: low-probability events tend to be priced a little too high, heavy favorites a little too low.⁷ A basket of 4¢ longshots will, on average, resolve worse than 4¢ implies. It's small in deep markets, and it doesn't sink the broader claim, but it's a standing reminder that "the price is the true probability" is an excellent approximation, not a law of physics.

It needs a clean question. A market can only price what it can unambiguously resolve. Vague, long-horizon, or definitionally slippery questions ("will this be a good decade for democracy") don't get sharp prices, because there's no clean payout to anchor the trade. The forecasting edge lives where the question is crisp and the resolution is honest — which is its own deep problem, and the reason resolution gets its own essay in this series.

And about 2016

You're thinking it, so let's deal with it. "Didn't the markets blow 2016?" It's the standard rebuttal, and it's mostly a misremembering. On the night, the leading election markets had the eventual winner at something like one chance in six to one in five. That's not a market saying it wouldn't happen — that's a market saying it was unlikely-but-live, roughly the odds of rolling a particular number on a die. Plenty of polls and pundits were far more lopsided and far more certain. When the underdog won, the people who'd said "95% the other way" took the loss; the market, which had said "more like 80/20," was closer to honestly uncertain the whole time.

So 2016 is a genuine caution, just not the one people quote. It's not evidence that markets are dumb. It's evidence that a probability is not a promise — an 83% favorite loses 17% of the time, and that's the system working, not failing. If you grade markets the way you should grade any forecaster — over many calls, by calibration, not by cherry-picking the upset — they hold up. The cherry-pick is the error, and a market is the forecaster least likely to commit it, because it never claimed certainty in the first place.

An 83% favorite that loses isn't a broken market. It's the 17% showing up — which it's supposed to do.

Markets, polls, models, experts — the honest ranking

Put it together without the cheerleading. Across the better studies, the ranking is fairly stable: a liquid market ≥ a well-built aggregate (poll average, superforecaster pool, ensemble model) > a single good model > a lone expert > a confident pundit ≈ chance. But the gaps between the top three are small and conditional, and the headline result of the last twenty years isn't "markets beat experts." It's subtler and more useful: aggregation beats individuals, and incentives beat opinions. The market wins because it is the purest available form of both at once — not because there's something mystical in a price.

Liquid market

Aggregates + incentivizes + updates

Usually the most accurate when there's real depth and a clean question. Well-calibrated as probabilities. Fails on thin liquidity.

Aggregated humans

Poll average · superforecaster pool

Close behind, sometimes level. The Good Judgment Project shows a well-run, updated, aggregated pool is genuinely hard to beat.

A single good model

One ego-free statistical rule

Beats most humans on stable questions — Tetlock's simple extrapolations embarrassed the experts — but blind to anything outside its inputs.

A lone expert

High variance · ungraded · often loud

Barely beats chance on average; confidence doesn't track accuracy. The benchmark everyone overrates — and the one the others beat.

Fig 3 — the honest ranking: aggregation beats individuals, incentives beat opinions · the top gaps are modest

The synthesis: a benchmark, not an oracle

If markets aren't magic, what's the right way to use them? Not as a crystal ball — as a benchmark. The best forecast of a hard question usually combines the three: a market to aggregate and stay live, models to enforce discipline and catch what the crowd is fading, expert judgment to frame the question and read the regime the data can't see. They have uncorrelated weaknesses, which is exactly why blending them works; this is the lesson of every serious forecasting tournament, where the winning entry is almost always an ensemble, not a hero.

What the market gives you is the number to beat. If your model disagrees with a deep, liquid market, that's not automatically your edge — it's a flag that says show your work. Maybe you know something the crowd doesn't and there's money on the floor; that's the whole game, and the people who do the work get paid for exactly this. Or maybe the market is pricing something you missed, and the disagreement is your error wearing the costume of an insight. The market doesn't tell you which. It just sets an honest, money-backed bar and dares you to clear it.

That's the steel-manned case, and it's stronger than the slogan it replaces. Not "the crowd is always right" — it isn't, and the longshots and the thin markets and the occasional public bubble prove it. The real claim is narrower and far better supported: a liquid market, on a clean question, is the most honest and least improvable forecast we currently know how to produce — and when it's beatable, beating it is real work, not an opinion. Twenty years of data, from dart-throwing experts to hobbyist superforecasters to the Iowa markets, all point at the same unglamorous mechanism: aggregate widely, weight by skin in the game, update forever. A prediction market is just the cleanest machine we've built for doing all three at once. That's not magic. It's better than magic, because you can check it.

Notes

The claim that liquid prediction markets outperform polls, expert panels, and statistical models is made and sourced in Part 3 of this series, Why prediction markets matter; the foundational survey is Justin Wolfers & Eric Zitzewitz, "Prediction Markets," Journal of Economic Perspectives 18(2), 2004.
Philip E. Tetlock, Expert Political Judgment: How Good Is It? How Can We Know? (Princeton University Press, 2005). ~82,000 forecasts from 284 experts over roughly two decades; experts barely beat chance on calibration and were often outperformed by simple extrapolation rules. The "dart-throwing chimpanzee" line and the fox/hedgehog distinction are from this work.
The IARPA-funded forecasting tournament (the ACE program) and the Good Judgment Project: Barbara Mellers, Lyle Ungar, Philip Tetlock et al., "Psychological Strategies for Winning a Geopolitical Forecasting Tournament," Psychological Science 25(5), 2014; popularized in Philip Tetlock & Dan Gardner, Superforecasting: The Art and Science of Prediction (2015). GJP's top "superforecasters" outperformed control groups and reportedly bested intelligence-community analysts with access to classified data, by a margin reported around 30%.
Joyce Berg, Robert Forsythe, Forrest Nelson & Thomas Rietz, "Results from a Dozen Years of Election Futures Markets Research," in Handbook of Experimental Economics Results (2008); and Berg, Nelson & Rietz, "Prediction market accuracy in the long run," International Journal of Forecasting 24(2), 2008. Across the studied U.S. presidential elections, Iowa Electronic Markets prices were closer to the outcome than contemporaneous polls a majority of the time, with the advantage largest well before election day.
On the calibration of market prices and how to grade a probabilistic forecast (calibration curves and the Brier score), see Part 4 of this series, A price is a probability. The Brier score is from Glenn W. Brier, "Verification of Forecasts Expressed in Terms of Probability," Monthly Weather Review (1950).
F. A. Hayek, "The Use of Knowledge in Society," American Economic Review 35(4), 1945 — the dispersed-knowledge argument and the price-as-information mechanism. Developed for prediction markets in the flagship, Why prediction markets matter.
Favorite–longshot bias is a long-documented regularity in betting and prediction markets: longshots are over-priced and favorites under-priced relative to true frequencies. See Erik Snowberg & Justin Wolfers, "Explaining the Favorite–Longshot Bias," Journal of Political Economy 118(4), 2010, and the broader market-microstructure literature.