Moving Past RLHF: In 2025 We Will Transition from Preference Tuning to Reward Optimization in Foundation Models
Was this email forwarded to you? Sign up here Moving Past RLHF: In 2025 We Will Transition from Preference Tuning to Reward Optimization in Foundation ModelsModels like GPT-o3 and Tülu 3 are showing the way.A brief note: Given the limited market activity during the holiday season, we will replace our traditional Sunday edition for this week and next week with our popular 'The Sequence Chat,' in which we discuss some original ideas about the AI space. Now onto today’s subject: In a recent essay in this newsletter we explored the transition from an emphasis in pretraining to post-training in foundation models. The release of models like GPT-o1 and the initial details about GPT-o3 as well as frameworks such as Tülu 3 really provide a glimpse of that trajectory. However, even within the post-training space we are seeing super intriguing changes in techniques. One of those is the transition from preference tuning with methods such as the famous RLHF to reward modeling. Today, I would like to explore some ideas about how preference tuning paved the way for reward optimization, examines the impact and limitations of RLHF, and discusses the emergence of new reward models that aim to capture complex human values more effectively. Modern artificial intelligence has reached a pivotal stage with the advent of foundation models—massive neural networks that can be adapted to an array of tasks through minimal fine-tuning. These models, which learn statistical patterns from sprawling corpora of text, possess an extraordinary ability to generate and interpret natural language. However, as they grow more powerful, the need to align their outputs with human goals, values, and preferences becomes both more urgent and more challenging. Initially, preference tuning served as the de facto approach to alignment, relying on human-annotated datasets to guide model behavior. Although preference tuning yields significant benefits in terms of helpfulness and safety, it struggles to incorporate the full range of human intentions, values, and context-specific nuances. In response, researchers have turned to reward optimization, particularly approaches like Reinforcement Learning from Human Feedback (RLHF), to further refine model behavior based on explicit reward signals. Within this rapidly evolving field, recent projects—such as GPT-o3 deliverative alignment and Tülu 3—exemplify the shift from preference-based fine-tuning to more dynamic, reward-focused paradigms. The Rise of Foundation Models and Preference Tuning...Subscribe to TheSequence to unlock the rest.Become a paying subscriber of TheSequence to get access to this post and other subscriber-only content. A subscription gets you:
|
Older messages
Edge 460: Anthropic's New Protocol to Link AI Assistants to Data Sources
Thursday, December 26, 2024
Model Context Protocols is one of the recent AI contributions of the AI lab. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 459: Quantization Plus Distillation
Tuesday, December 24, 2024
Some insights into quantized distillation ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The Race for AI Reasoning is Challenging our Imagination
Sunday, December 22, 2024
New reasoning models from Google and OpenAI ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 458: From Pre-training to Post-training. Inside the Amazing Tülu 3 Framework
Thursday, December 19, 2024
A major release by AI2, includes the major components to build post-training pipelines. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 456: Inside the Toughest Math Benchmark Ever Built
Thursday, December 19, 2024
FrontierMath pushes the boundaries of mathematical reasoning in foundation models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
Detecting Parasites 🪲
Wednesday, January 1, 2025
A newsletter publisher takes on parasite SEO. Here's a version for your browser. Hunting for the end of the long tail • December 31, 2024 Detecting Parasites Our year-end award for best blog post
Final Chance to Claim Your Bonus Gift 🎁
Tuesday, December 31, 2024
Just sign up to VC+ by January 1st and we'll include a free gift. View email in browser OFFER ENDS JANUARY 1ST Last Chance to Claim Your Free Gift! The Global Forecast Series, presented by Inigo,
Top 5 of 2024, Ninja, Deploying on AWS, and More
Tuesday, December 31, 2024
\#1: Build Captivating Display Tables With Great Tables #662 – DECEMBER 31, 2024 VIEW IN BROWSER The PyCoder's Weekly Logo A lot has happened in the Python ecosystem in 2024 and with our final
Daily Coding Problem: Problem #1654 [Hard]
Tuesday, December 31, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Quora. Word sense disambiguation is the problem of determining which sense a word takes
🐶 Robot Pets Are Making a Comeback — Blu-ray Players Will Soon Be Almost Gone
Tuesday, December 31, 2024
Also: The Issue With EVs Is That They're Doing Too Much, and More! How-To Geek Logo December 31, 2024 Did You Know In Western nations, people celebrate the New Year at the start of the Gregorian
Charted | What Made the News in America in 2024 📊
Tuesday, December 31, 2024
Election-related events and crime dominated the news according to Google's yearly search data drop. View Online | Subscribe | Download Our App FEATURED STORY What Made the News in America in 2024
️🚨 New Windows security bug - what to know
Tuesday, December 31, 2024
LG's AI laptops; Free Apple TV+; Life-saving smartwatch -- ZDNET ZDNET Tech Today - US December 31, 2024 Windows 11 updates New Windows 11 24H2 bug could block future security updates - see
End of Year Sale
Tuesday, December 31, 2024
Discount for the end of the year Sébastien Dubois DeveloPassion's Newsletter End of Year Sale By Sebastien Dubois • 31 Dec 2024 View in browser View in browser I'm launching a sale for the end
Post from Syncfusion Blogs on 12/31/2024
Tuesday, December 31, 2024
New blogs from Syncfusion What's New in WPF Diagram: 2024 Volume 4 By Sarathkumar V This blog explains the new features and enhancements added in the Syncfusion WPF Diagram Library for the 2024
Get Organized for the New Year With This Updated Calendar App
Tuesday, December 31, 2024
Informant 5 is a complete planner in your pocket. Manage Calendars, Tasks, Projects, and Tags in a single app. This app is one of the few that combines both your calendar AND your tasks into a singe