The Sequence Chat: Lewis Tunstall, Hugging Face, On Building the Model that Won the AI Math Olympiad
Was this email forwarded to you? Sign up here The Sequence Chat: Lewis Tunstall, Hugging Face, On Building the Model that Won the AI Math OlympiadDetails about NuminaMath, its architecture, training process and even things that didn't work.Bio:Lewis Tunstall is a Machine Learning Engineer in the research team at Hugging Face and is the co-author of the bestseller “NLP with Transformers” book. He has previously built machine learning-powered applications for start-ups and enterprises in the domains of natural language processing, topological data analysis, and time series. He holds a PhD in Theoretical Physics, was a 2010 Fulbright Scholar and has held research positions in Australia, the USA, and Switzerland. His current work focuses on building tools and recipes to align language models with human and AI preferences through techniques like reinforcement learning.
I currently lead the post-training team at Hugging Face, where we focus on providing the open-source community with robust recipes to fine-tune LLMs through libraries like TRL. In a previous life, I was a theoretical physicist researching the strong nuclear force and its connection to dark matter. I accidentally stumbled into AI via my colleagues in experimental physics, who were very excited about applying a (new at the time) technique called “deep learning” to particle collisions at the Large Hadron Collider. I was surprised to learn that around 100 lines of TensorFlow could train a neural net to extract new signals from collision data, often much better than physics-derived features. This prompted me to take part in a Kaggle competition with a few physics friends, and I’ve been hooked on AI ever since! 🛠 ML Work
This project was a collaboration between Hugging Face and Numina, a French non-profit that was inspired by the AI Math Olympiad (AIMO) to create high-quality datasets and tools for the open-source community. Although there are several open weight models for mathematics, the training datasets are rarely, if ever, made public. We teamed up with Numina to bridge this gap by tackling the first AIMO progress prize with a large-scale dataset of around 850,000 math problem-solution pairs that Numina had been developing prior to the competition. We saw that winning the AIMO competition would be a great way to show the community the power of high-quality datasets and I’m very happy to see that it worked out well!
That’s a great question and I suspect the answer will depend on how straightforward it is to integrate multiple external verifiers like Lean4 and Wolfram Alpha for models to check their proofs. The current systems are trained to interface with a single external solver, but I expect the next iteration of DeepMind’s approach will involve a generalist model that can use multiple solvers, like how LLMs currently use tools for function calling.
This model was largely chosen due to the constraints of the competition: onthe one hand, only pretrained models that were released before February 2024 could be used, and on the other hand, each submission had to run on 2 T4 GPUs in under 9 hours (not easy for LLMs!). Both these constraints made DeepSeekMath 7B the best choice at the time, although there are now much better math models like those from Qwen which would likely score even better in the competition.
We experimented with various fine-tuning recipes, but the one that gave the best results was based on an interesting paper called MuMath-Code. This paper combines two insights that have been shown to improve the reasoning capabilities of LLMs: Chain of Thought (CoT) templates that provide the intermediate steps needed to obtain the correct answer, and tool-integrated reasoning (TIR), where the LLM is given access to a tool like Python to run computations. Prior research had explored each one independently, but MuMath-Code showed that you get best results by performing two-stage training: first on the CoT data (to learn how to solve problems step-by-step), followed by training on TIR data (to learn how outsource part of the problem to Python). We applied this approach with DeepSeekMath 7B to the datasets Numina created and found it worked really well on our internal evals and the Kaggle leaderboard!
Our training recipe was tailored to competitive mathematics, where one has a known answer for each problem. For competitive physics or chemistry problems, I suspect the method would generalise provided one integrates the relevant tools for the domain. For example, I suspect that in chemistry one requires access to a broad range of tools in order to solve challenging problems.
Our winning recipe was quite simple in the end and involved applying two rounds of supervised fine-tuning (SFT) to the DeepSeekMath 7B model:
For tools, we use the TRL library for training out models on one node of 8 x H100 GPUs. For evaluations and the Kaggle submissions we used vLLM which provides fast inference for LLMs.
The main ideas we tried but didn’t make it to the final submission were the following:
💥 Miscellaneous – a set of rapid-fire questions
As a former physicist, I am excited about the advances being made in AI4Science and specifically in simulating physical processes. I quite like the way Chris Bishop frames this as the “fifth paradigm” of scientific discovery and am hopeful this will help us find answers to some rather thorny problems.
It’s always risky to predict the future, but I think it’s clear that the current systems are lacking several capabilities like episodic memory and the ability to plan over long horizons. Whether these capabilities will emerge at larger scales remains to be seen, but history generally shows that one shouldn’t bet against deep learning!
My current view is that the main barrier for models to generate novel ideas in mathematics and the natural sciences is having some means to verify correctness and compatibility with experiment. For mathematics, I suspect that projects like Lean4 will play a key role in bridging the gap from proving what is known towards generating novel theorems which are then verified automatically by the model for correctness. For the natural sciences, the challenge appears to be much harder since one needs to both synthesize vast amounts of experimental data and possibly run new experiments to validate new ideas. Nevertheless, there are already hints from works like Sakana’s AI Scientist that formulating novel ideas in ML research is possible, albeit under rather stringent constraints. It will be exciting to see how far this can be pushed into other domains!
My favorite mathematician is John von Neumann, mostly because I didn't really understand quantum mechanics until I read his excellent textbook on the subject. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 429: MambaByte and the Idea of Tokenization-Free SSMs
Tuesday, September 10, 2024
Can SSMs operated on raw data instead of tokens? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Sakana AI
Sunday, September 8, 2024
A new $100 million round for the creators of The AI Scientist ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 428: Inside PrompPoet: Character.ai's Framework for Prompt Engineering
Thursday, September 5, 2024
The open source framework abstracts the core building blocks for prompt creation, optimization and management. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 427: Jamba Combines SSMs, Transformers and MOEs in a Single Model
Tuesday, September 3, 2024
Can a hybrid design outperform each one of the baseline architectures? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Cerebras Inference and the Challenges of Challenging NVIDIA’s Dominance
Sunday, September 1, 2024
Why does NVIDIA remains virtually unchallenged in the AI chip market? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
Daily Coding Problem: Problem #1646 [Medium]
Monday, December 23, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Facebook. Write a function that rotates a list by k elements. For example, [1, 2, 3, 4,
GCP Newsletter #430
Monday, December 23, 2024
Welcome to issue #430 December 23rd, 2024 News Event Official Blog Calling all devs: Code the future of baseball with Google Cloud and MLB - Google Cloud and MLB are hosting a hackathon where
⏯️ Make a Holiday Guest Profile for Your Streaming Services — What Is Linux Mint?
Monday, December 23, 2024
Also: I Played the Worst Mobile Games So You Don't Have To, and More! How-To Geek Logo December 23, 2024 Did You Know The giant splashes of color that make poinsettias a popular holiday decoration
Ranked | The Most Satisfying vs. Most Reliable Car Brands in 2024 🚙
Monday, December 23, 2024
The most reliable car brands are rarely the most satisfying to own, according to recent Consumer Reports survey data. View Online | Subscribe | Download Our App Presented by: Find the megatrends
Bitcoin Enthusiasts Are Letting Altcoins Pass by
Monday, December 23, 2024
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, December 23, 2024? The
Last Minute Gifts from Walmart
Monday, December 23, 2024
ZDNET ZDNET Sponsored Message In Partnership with Walmart December 23, 2024 exclusive offer Walmart Last-minute gifts from Walmart Shop Now Walmart The tech you've been wishing for–at everyday low
15 ways AI saved me weeks of work in 2024
Monday, December 23, 2024
ZDNET's product of the year; Windows 11 24H2 bug list updated -- ZDNET ZDNET Tech Today - US December 23, 2024 AI applications on various devices. 15 surprising ways I used AI to save me weeks of
Distributed Locking: A Practical Guide
Monday, December 23, 2024
If you're wondering how and when distributed locking can be useful, here's the practical guide. I explained why distributed locking is needed in real-world scenarios. Explored how popular tools
⚡ THN Weekly Recap: Top Cybersecurity Threats, Tools and Tips
Monday, December 23, 2024
Your one-stop-source for last week's top cybersecurity headlines. The Hacker News THN Weekly Recap The online world never takes a break, and this week shows why. From ransomware creators being
⚙️ OpenA(G)I?
Monday, December 23, 2024
Plus: The Genesis Project