Astral Codex Ten - Who Predicted 2023?
I. The Annual Forecasting Contest…is one of my favorite parts of this blog. I get a spreadsheet with what are basically takes - “Russia is totally going to win the war this year”, “There’s no way Bitcoin can possibly go down”. Then I do some basic math to it, and I get better takes. There are ways to look at a list of 3300 people’s takes and do math and get a take reliably better than all but a handful of them. Why is this interesting, when a handful of people still beat the math? Because we want something that can be applied prospectively and reliably. If John Smith from Townsville was the highest scoring participant, it matters a lot whether he’s a genius who can see the future, or if he just got lucky. Part of the goal of this contest was to figure that out. To figure out if the most reliable way to determine the future was to trust one identifiable guy, to trust some mathematical aggregation across guys, or something else. Here’s how it goes: in January 2023, I asked people to predict fifty questions about the upcoming year, like “Will Joe Biden be the leading candidate in the Democratic primary?” in the form of a probability (eg “90% chance”). About 3300 of you kindly took me up on that (“Blind Mode”). Then I released the list of 3300 x 50 guesses, and asked people to analyze them with the aggregation algorithm of their choice to produce what they thought was the best possible list. 460 of you took me up on that (“Full Mode”). Then I waited until 2024 and sent everything to Eric Neyman, who’s better at math than I am. He used the Metaculus scoring function to assess everyone’s accuracy. Thanks to Eric (and to Sam Marks, who helped last time around) for taking care of this. II. And The Winners Are . . .For Blind Mode - where you had to rely on your wits alone and couldn’t spend more than five minutes per question - the winners are:
And there was also Full Mode, where you could read everyone else’s predictions first, check prediction markets, apply whatever algorithms you wanted, and take as long as you needed. While the Blind Mode winners were amateurs or completely unidentifiable, the Full Mode winners were mostly long-time forecasting veterans.
Here are some other scores I found interesting:
III. What Did We Learn?Okay, fine, but you don’t know most of these people. The really interesting question is how individuals like these compare to prediction markets, experts, and the wisdom of crowds. How much of their success is luck vs. skill? And if we have data like this next time, how do we best predict the future? Here’s what I’ve got: Going over this bit by bit: Median participant: Score of 0 and 50th percentile by definition. Is this for median participant in Blind Mode (due date in January, couldn’t check others’ guesses, < 5 minutes research) or Full Mode (due date in February, could check others’ guesses, unlimited research)? It doesn’t matter! For some reason, these two contests had almost exactly the same median score! I’m unprincipledly lumping them together for the rest of the discussion - when I cite prediction market numbers, it will be from somewhere in the middle of their January and February scores. 50% on everything: If you literally guessed 50% for all your predictions, you would have done very slightly better than our average participant. More than half of you subtracted value from total uncertainty! Median superforecaster: 56 people who had been previously declared “superforecasters” (usually by doing very well in a previous tournament) were kind enough to participate. These people did better than average, but not by too much - the median superforecaster scored in the 70th percentile of all participants. Median 2022 winner: Did our winners win by luck or skill? One way of assessing this is to see how the 2022 winners did this year. Of the 15 top-scoring 2022 participants, 5 of them foolishly decided that instead of resting on their laurels they would try again this year. On average, they scored in the 88th percentile - ie 395th place. I conclude that overall, most winners are around the 90th percentile of skill - but it’s luck that brings them the rest of the way to the leaderboard. Manifold Markets: Manifold, a popular play money prediction market site, kindly agreed to open markets into our fifty questions so we could compare them to participants. The markets got between 80 and 1500 participants, average around 150. Their forecast, had it been a contestant, would have placed in the 89th percentile. This would be good for an individual, but it’s surprisingly bad for an aggregation method - in fact, it’s worse than taking the median of a randomly selected group of 150 participants! The market mechanism seems to be subtracting value! Someone might want to double-check this. Participant aggregate: This is the “wisdom of crowds” one. If you average the guess of every participant (eg if someone says 80% chance Biden leads, and another says 90% chance, then you go with 85%), you usually do better than the vast majority of individuals. In this case, the aggregate was 95th percentile, beating out superforecasters and Manifold. Superforecaster aggregate: If you just average the guesses of superforecasters, you do even better. This isn’t trivial - superforecasters are a smaller crowd than the set of all participants - but in this case the higher-quality data trumped the larger crowd size. Samotsvety: Samotsvety is a well-known forecasting team that usually wins these kinds of things. You can read more about them here. They scored 98th percentile, better than the aggregate of all other superforecasters. There’s are a few asterisks on this result: first, it wasn’t exactly a team effort - one of their forecasters did the work and “ran it by” everyone else without getting any objections. Second, for complicated legal reasons that they explained and which satisfied me, they couldn’t enter the contest proper and had to send me their guesses later, so I had to take it on trust that they were made in January along with everyone else’s. Metaculus: A “forecasting engine” that serves the same role as a prediction market but operates slightly differently. They ask everyone to guess a question, then aggregate answers weighted by past performance and a proprietary algorithm. Metaculus scored in the 99.5th percentile of our contest and was the top performer other than random individuals who might have just gotten lucky. Ezra Karger: …is a possible exception to the above claim. He’s a non-random individual - director of the Forecasting Research Institute - and has previously placed very highly in contests like these (he placed 7th in last year’s ACX contest). Based on this, I suspect his performance was mostly repeatable skill and not just luck. He outscored all but four of our 4,215 Blind Mode and Full Mode participants, which puts him above the 99.9th percentile. Since he entered Full Mode, he was allowed to do complicated technical things, and he described his method as:
Small Singapore won Blind Mode. As I said before, they’re a total mystery to me and I don’t know if they won by luck or not. Douglas Campbell runs a prediction market, which I guess also makes him non-random, but I hadn’t previously heard of him being an exceptional forecaster himself, so I don’t know how much to weight this. He describes his method as:
(I kind of want to make a Virgin vs. Chad meme comparing his answer with Ezra’s, but I’ll restrain myself out of respect for the dignity of our participants.) IV. Out Of Distribution EventsAnother fun thing we can do with these data is see which 2023 events were most vs. least surprising: The first colored column represents average score on each question. A more negative number means that more people got the question wrong (gave a low probability for something that happened, or a high probability for something that didn’t). The second colored column represents correlation between each question and overall score. A question where good forecasters beat bad forecasters is positive; a question where bad forecasters beat good forecasters is negative. Why would bad forecasters ever beat good forecasters? This means an event was unlikely, but happened anyway. For example, if people were asked to predict if some random person would win the lottery, smarter people would be more likely to predict no. If by coincidence he did win the lottery, then smarter people would have lower scores than dumber people. I’m torn which of these matches our intuitive conception of “surprising event”, but both methods suggest forecasters were very surprised that Bitcoin ended the year over $30,000 (it started the year around $16,500, and ended at $43,000). Bitcoin is now up to $68,000, which I imagine would have been even more surprising to these people! (weirdly, good forecasters were more likely than bad forecasters to believe Bitcoin would go up at all, but less likely to believe it would go up as much as it did) Other resolutions that book people by surprise: that Starship didn't reach orbit, that inflation dropped so fast, and that Joe Biden's approval rating stayed as low as it did. The least surprising thing about 2023 was that nobody used a nuclear weapon. V. Takeaways And ThanksMy main takeaway is that Metaculus beats prediction markets, superforecasters, wisdom of crowds, and (probably, most of the time) Samotsvety. Based on the performance of last year’s winners, most people who outperform Metaculus do so by luck and will regress to the mean next year. This contest leaves open the possibility that a small number of people (maybe including Ezra Karger) might be able to consistently get super-Metaculus performance - it just takes more than one contest to identify them. This doesn’t mean that most prediction markets and superforecasters are useless. It just means that their benefit comes from being faster and easier to invoke than Metaculus, not from being more accurate. Metaculus is hosting a 2024 version of this contest, which due to my delay in getting this up is already closed. I’ll let you know how it goes. And hopefully I’ll have enough time next year to be more involved in the 2025 version. Thanks to everyone who participated in this contest. Extra thanks to Christian Williams from Metaculus and the Manifold team for getting their respective sites involved, to Jonathan Mann and Samotsvety for willingly submitting to testing, and to Eric Neyman for calculating the scores. If you included an ID key in your entry, you can find your score here:
You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Open Thread 318
Monday, March 4, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Book Review Contest Rules 2024
Saturday, March 2, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Links For February 2024
Thursday, February 29, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Less Utilitarian Than Thou
Wednesday, February 28, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Open Thread 317
Monday, February 26, 2024
... ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
What A Day: MTGeeze Louise
Saturday, November 23, 2024
DOGE just got dumber. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Someone Needs to Tell the Manhattan DA’s Office the Trump Case Is Over
Friday, November 22, 2024
Columns and commentary on news, politics, business, and technology from the Intelligencer team. Intelligencer the law Somebody Needs to Tell the Manhattan DA's Office It's Over The Trump hush-
Black Friday Looms
Friday, November 22, 2024
The already-live deals that are actually worth shopping. The Strategist Every product is independently selected by editors. If you buy something through our links, New York may earn an affiliate
Google opens another traffic spigot for publishers
Friday, November 22, 2024
PLUS: Why Apple News might start generating more revenue for publishers ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
LEVER TIME: The Secret Recordings Netanyahu Wants Censored
Friday, November 22, 2024
A new documentary exposes never-before-seen video of the Israeli leader — and argues he's prolonged the Gaza War to evade corruption charges. In the latest edition of Lever Time, producer Arjun
No News is Good News
Friday, November 22, 2024
Tuning Out, Weekend Whats, Feel Good Friday ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Why Amazon is doubling its Anthropic investment to $8 billion | Windows Recall makes delayed debut
Friday, November 22, 2024
'Bomb cyclone' drives up EV charging station demand ADVERTISEMENT GeekWire SPONSOR MESSAGE: Get your ticket for AWS re:Invent, happening Dec. 2–6 in Las Vegas: Register now for AWS re:Invent.
Pinky and the (lab-grown) Brain
Friday, November 22, 2024
Plus: 50 people working to make the future a better place, and more. View this email in your browser Each week, a different Vox editor curates their favorite work that Vox has published across text,
A cheap multi-cooker that’s surprisingly good
Friday, November 22, 2024
Plus, more things worth the hype View in browser Ad The Recommendation Ad This 10-in-1 multi-cooker won't ruin your kitchen's aesthetic—and it's only $60 Two images next to each other. On
Joyriding Rats, 60 Thanksgiving Recipes, and the Sexiest Collard Farmer
Friday, November 22, 2024
Scientists have discovered that laboratory rats don't just drive tiny cars—they actually enjoy taking the scenic route. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏