͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Was this email forwarded to you? Sign up here

Moving Past RLHF: In 2025 We Will Transition from Preference Tuning to Reward Optimization in Foundation Models

Models like GPT-o3 and Tülu 3 are showing the way.

Dec 29

READ IN APP

A brief note: Given the limited market activity during the holiday season, we will replace our traditional Sunday edition for this week and next week with our popular 'The Sequence Chat,' in which we discuss some original ideas about the AI space. Now onto today’s subject:

In a recent essay in this newsletter we explored the transition from an emphasis in pretraining to post-training in foundation models. The release of models like GPT-o1 and the initial details about GPT-o3 as well as frameworks such as Tülu 3 really provide a glimpse of that trajectory. However, even within the post-training space we are seeing super intriguing changes in techniques. One of those is the transition from preference tuning with methods such as the famous RLHF to reward modeling. Today, I would like to explore some ideas about how preference tuning paved the way for reward optimization, examines the impact and limitations of RLHF, and discusses the emergence of new reward models that aim to capture complex human values more effectively.

Modern artificial intelligence has reached a pivotal stage with the advent of foundation models—massive neural networks that can be adapted to an array of tasks through minimal fine-tuning. These models, which learn statistical patterns from sprawling corpora of text, possess an extraordinary ability to generate and interpret natural language. However, as they grow more powerful, the need to align their outputs with human goals, values, and preferences becomes both more urgent and more challenging.

Initially, preference tuning served as the de facto approach to alignment, relying on human-annotated datasets to guide model behavior. Although preference tuning yields significant benefits in terms of helpfulness and safety, it struggles to incorporate the full range of human intentions, values, and context-specific nuances. In response, researchers have turned to reward optimization, particularly approaches like Reinforcement Learning from Human Feedback (RLHF), to further refine model behavior based on explicit reward signals.

Within this rapidly evolving field, recent projects—such as GPT-o3 deliverative alignment and Tülu 3—exemplify the shift from preference-based fine-tuning to more dynamic, reward-focused paradigms.

The Rise of Foundation Models and Preference Tuning...

Subscribe to TheSequence to unlock the rest.

Become a paying subscriber of TheSequence to get access to this post and other subscriber-only content.

A subscription gets you:

	Full access to TheSequence Edge – what's new in AI + the most relevant ML concepts, research papers, tech solutions
	Full archive
	Comments and discussions

Like

Comment

Restack

Moving Past RLHF: In 2025 We Will Transition from Preference Tuning to Reward Optimization in Foundation Models

Moving Past RLHF: In 2025 We Will Transition from Preference Tuning to Reward Optimization in Foundation Models

Models like GPT-o3 and Tülu 3 are showing the way.

The Rise of Foundation Models and Preference Tuning...

Subscribe to TheSequence to unlock the rest.

A subscription gets you:

Older messages

Edge 460: Anthropic's New Protocol to Link AI Assistants to Data Sources

Edge 459: Quantization Plus Distillation

The Race for AI Reasoning is Challenging our Imagination

Edge 458: From Pre-training to Post-training. Inside the Amazing Tülu 3 Framework

Edge 456: Inside the Toughest Math Benchmark Ever Built

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR