o3—the new state-of-the-art reasoning model - Sync #498
I hope you enjoy this free post. If you do, please like ❤️ or share it, for example by forwarding this email to a friend or colleague. Writing this post took around eight hours to write. Liking or sharing it takes less than eight seconds and makes a huge difference. Thank you! o3—the new state-of-the-art reasoning model - Sync #498Plus: Nvidia's new tiny AI supercomputer; Veo 2 and Imagen 3; Google and Microsoft release reasoning models; Waymo to begin testing in Tokyo; Apptronik partners with DeepMind; and more!Hello and welcome to Sync #498! This week, OpenAI concluded its 12 Days of OpenAI with the release of o3, OpenAI’s new reasoning model that shattered all benchmarks and redefined what it means to be a state-of-the-art AI model. Elsewhere in AI, Microsoft and Google have introduced their own reasoning models—Phi-4 and Gemini 2.0 Flash Thinking Experimental, respectively. Google also unveiled the new version of its video generator, Veo 2, and its image generator, Imagen 3. Nvidia joined the release frenzy with the Jetson Orin Nano Super, a compact AI computer designed for edge and robotics applications. Speaking of robotics, Apptronik announced a partnership with Google DeepMind, while Waymo is heading to Tokyo. We also have a robot dragonfly built by the CIA in the 1970s and a robot that fooled rats it is one of them. In other news, a third person has received a gene-edited pig kidney, and we’ll learn how Strandbeest are built and how they evolve into becoming something more than a kinematic sculture. Enjoy! o3—the new state-of-the-art reasoning model12 Days of OpenAI started strong with the release of the full o1 model, OpenAI’s reasoning model. And OpenAI’s Shipmas concluded even stronger with o3—the new state-of-the-art reasoning model. When the full o1 model was released on the first day of 12 Days of OpenAI, I think some people were expecting more than what had been demonstrated by the o1-preview. While the full o1 might have been disappointing, o3, the newest model in OpenAI’s line of reasoning models, not only delivered but also went the extra mile. Without further ado, below are the benchmark results for o3, provided by OpenAI. These results are truly impressive, as they match and sometimes exceed those of top humans. Consider the Codeforces scores, where o3 achieved 2727. That score qualifies o3 for the rank of International Grandmaster, a title reserved for the best competitive programmers who are in the top 0.05%. The results for the EpochAI Frontier Math benchmark are equally impressive. Each test in this benchmark is designed to require hours of work, even from expert mathematicians. Models like GPT-4 and Gemini could not score more than 2%. o3 scored 25.2%. That’s a massive leap in performance and reasoning capabilities. However, there was another benchmark that o3 crushed. ARC-AGI is regarded as one of the toughest benchmarks for assessing the intelligence of AI models. Below is an example of an ARC-AGI test—each test presents examples of a pattern and then requires solving a challenge based on that pattern. Most humans can easily solve the ARC-AGI tests, but AI models have struggled with them so far. GPT-4o scored only 5%. o1, when run on a high-compute setting, scored 32%. The best open-source project achieved a score of 54.5%. So, how did o3 score on the ARC-AGI benchmark? o3 scored 76%, matching the average human score and significantly outperforming both o1 and leading open-source projects. However, that score was achieved on a low-compute setting. On a high-compute setting (approximately 172 times more compute than the low configuration), o3 scored 88%. In the ARC-AGI benchmark, o3 stands in a class of its own, with no competitors coming even remotely close. OpenAI hasn’t revealed when o3 will be publicly available, but it seems we can expect the release in Q1 of 2025. However, if you are an AI safety researcher, you have an opportunity to help evaluate and identify potential safety and security implications of o3 ahead of its public release. o3 marks a new chapter in AI research. We now have an AI model capable of matching and sometimes exceeding human experts in tasks requiring high levels of reasoning, such as solving challenging algorithmic or mathematical problems. Moreover, o3 and similar models will continue to improve, bounded only by how much computing power can they use. The difference between o3 in low-computing and high-computing configurations in the ARC-AGI highlights just how much better these models perform when given more time to "think." Sure, that high-compute run cost OpenAI 172 times more computing power, translating to about $350,000 for that single run. However, it’s worth noting that we will soon see more and larger AI-focused data centres coming online. Additionally, more efficient chips offering greater computing power will become available. What OpenAI paid for that high-compute run might cost just a fraction of those $350,000 in a year or two. With o3, we might be at the same point as when GPT-4 was released two years ago—a new model redefining the state-of-the-art and unlocking new possibilities. Now we have to wait until o3 is out to see what we can do with it and with what OpenAI competitors will respond. If you enjoy this post, please click the ❤️ button or share it. Do you like my work? Consider becoming a paying subscriber to support it For those who prefer to make a one-off donation, you can 'buy me a coffee' via Ko-fi. Every coffee bought is a generous support towards the work put into this newsletter. Your support, in any form, is deeply appreciated and goes a long way in keeping this newsletter alive and thriving. 🦾 More than a humanA woman in the US is the third person to receive a gene-edited pig kidney First baby conceived via breakthrough fertility tech born 🧠 Artificial IntelligenceState-of-the-art video and image generation with Veo 2 and Imagen 3 NVIDIA Unveils Its Most Affordable Generative AI Supercomputer Nvidia has launched the Jetson Orin Nano Super, which it calls the most affordable generative AI supercomputer, priced at only $249. This compact computer, small enough to fit in the palm of your hand, delivers 67 INT8 TOPS (tera operations per second) of AI performance and is capable of running many advanced AI models, including large language models. While the Jetson Orin Nano Super is primarily designed for robotics, it is likely to also find applications in edge computing and among makers and hobbyists. I want one too! Google releases its own ‘reasoning’ AI model Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning GitHub launches a free version of its Copilot The era of open voice assistants has arrived ▶️ Safety Alignment Should be Made More Than Just a Few Tokens Deep (48:52) In this video, Yannic Kilcher analyses a paper titled Safety Alignment Should be Made More Than Just a Few Tokens Deep which, as the title suggests, argues that safety alignment processes primarily adjust the likelihood of the first few tokens in response being malicious, leaving the rest of the model's behaviour unchanged and vulnerable to attacks. The paper suggests some solutions to improve models’ safety while Kilcher suggests that baking safety into next-token prediction might not be a robust long-term solution. If you're enjoying the insights and perspectives shared in the Humanity Redefined newsletter, why not spread the word? 🤖 RoboticsWaymo to begin testing in Tokyo, its first international destination Boston Dynamics lays off 45 employees, 5 percent of its workforce Apptronik partners with Google DeepMind to advance humanoid robots with AI Why It’s Time to Get Optimistic About Self-Driving Cars In the 1970s, the CIA Created a Robot Dragonfly Spy. Now We Know How It Works. AI infiltrates the rat world: New robot can interact socially with real lab rats 🧬 BiotechnologyThe '4th Wave' of AI Drug Discovery is Here, According to This Report Gold-based drug slows cancer tumor growth by 82%, outperforms chemotherapy 💡Tangents▶️ 12,419 Days Of Strandbeest Evolution (21:38) I’m fascinated by Strandbeest, the walking kinematic sculptures created by Theo Jansen. There is something beautiful about those skeleton-like machines that walk using only wind as their source of power. In this video, Veritasium shares the story of Strandbeest, why Jansen built them, the engineering challenges he faced, and how Strandbeest evolve to become more like living organisms rather than sculptures. Thanks for reading. If you enjoyed this post, please click the ❤️ button or share it. Humanity Redefined sheds light on the bleeding edge of technology and how advancements in AI, robotics, and biotech can usher in abundance, expand humanity's horizons, and redefine what it means to be human. A big thank you to my paid subscribers, to my Patrons: whmr, Florian, dux, Eric, Preppikoma and Andrew, and to everyone who supports my work on Ko-Fi. Thank you for the support! My DMs are open to all subscribers. Feel free to drop me a message, share feedback, or just say "hi!" |
Older messages
Google's Agentic Era - Sync #497
Thursday, December 19, 2024
Plus: Sora is out; OpenAI vs Musk drama continues; GM closes Cruise; Amazon opens AGI lab; Devin is out; a humanoid robot with artificial muscles; NASA's new Martian helicopter; and more! ͏ ͏ ͏ ͏ ͏
OpenAI o1 goes Pro - Sync #496
Tuesday, December 10, 2024
Plus: DeepMind Genie 2; Google released Veo and Imagen 3 on Vertex AI; Tesla Optimus shows off new hand; Grok is free for all X users; ads might be coming to ChatGPT; Waymo comes to Miami; and more! ͏
Artists against AI - Sync #495
Saturday, November 30, 2024
Plus: Amazon's AI chips; OpenAI web browser with ChatGPT; new humanoid robot video; Human Cell Atlas releases its first draft; can humans hibernate?; AI agents behaving very human-like; and more! ͏
OpenAI's turbulent early years - Sync #494
Sunday, November 24, 2024
Plus: Anthropic and xAI raise billions of dollars; can a fluffy robot replace a living pet; Chinese reasoning model DeepSeek R1; robot-dog runs full marathon; a $12000 surgery to change eye colour ͏ ͏
Cracks in the Scaling Laws - Sync #493
Sunday, November 17, 2024
Plus: OpenAI's new AI agent; AlphaFold3 is open-source... kind of; Amazon releases its new AI chip; Waymo One is available for everyone in LA; how can humanity become a Kardashev Type 1
You Might Also Like
SRE Weekly Issue #456
Monday, December 23, 2024
View on sreweekly.com A message from our sponsor, FireHydrant: On-call during the holidays? Spend more time taking in some R&R and less getting paged. Let alerts make their rounds fairly with our
The Power of an Annual Review & Grammarly acquires Coda
Sunday, December 22, 2024
I am looking for my next role, Zen Browser got a fresh new look, Flipboard introduces Surf, Campsite shuts down, and a lot more in this week's issue of Creativerly. Creativerly The Power of an
Daily Coding Problem: Problem #1645 [Hard]
Sunday, December 22, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Facebook. Implement regular expression matching with the following special characters: .
PD#606 How concurrecy works: A visual guide
Sunday, December 22, 2024
A programmer had a problem. "I'll solve it with threads!". has Now problems. two he ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
RD#486 (React) Things I Regret Not Knowing Earlier
Sunday, December 22, 2024
Keep coding, stay curious, and remember—you've got this
🎶 GIFs Are Neat, but I Want Clips With Sound — Your Own Linux Desktop in the Cloud
Sunday, December 22, 2024
Also: 9 Games That Were Truly Ahead of Their Time, and More! How-To Geek Logo December 22, 2024 Did You Know Dextrose is another name for glucose, so if you see it listed prominently on the ingredients
Sunday Digest | Featuring 'The World’s 20 Largest Economies, by GDP (PPP)' 📊
Sunday, December 22, 2024
Every visualization published this week, in one place. Dec 22, 2024 | View Online | Subscribe | VC+ | Download Our App Hello, welcome to your Sunday Digest. This week, we visualized public debt by
Android Weekly #654 🤖
Sunday, December 22, 2024
View in web browser 654 December 22nd, 2024 Articles & Tutorials Sponsored Solving ANRs with OpenTelemetry While OpenTelemetry is the new observability standard, it lacks official support for many
😸 Our interview with Amjad Masad
Sunday, December 22, 2024
Welcome back, builders Product Hunt Sunday, Dec 22 The Roundup This newsletter was brought to you by AssemblyAI Welcome back, builders Happy Sunday! We've got a special edition of the Roundup this