Reinforcement Learning from Human Feedback (RLHF), the post-training method that powered ChatGPT, Claude 2, and Gemini 1.0, has been functionally displaced at every major AI lab as of Q1 2026. Four automated techniques — GRPO, DAPO, RLVR, and synthetic self-play — now dominate the AI post-training stack, producing stronger models at lower cost with minimal human annotation. The transition took 18 months, accelerating from DeepSeek’s R1 paper in January 2025 through widespread adoption by OpenAI, Google, Anthropic, and ByteDance by the end of that year.
Human feedback has not disappeared entirely — it still informs early-stage instruction tuning and safety evaluations. But it no longer sits at the core of capability training. That seat belongs to verifiable rewards, group policy optimization, and self-generated reasoning data.
What RLHF Was — and Why It Dominated for Three Years
RLHF entered mainstream AI through OpenAI’s InstructGPT paper (January 2022), authored by Long Ouyang and colleagues. The method operates in three stages: supervised fine-tuning on human demonstrations, training a reward model on human preference comparisons (typically 10,000–100,000 pairwise labels), and then optimizing the language model against that reward model using Proximal Policy Optimization (PPO).
The results were immediately compelling. InstructGPT with 1.3 billion parameters consistently outperformed raw GPT-3 at 175 billion parameters on human evaluations. Human alignment required relatively little compute compared to pretraining — but it did require humans, continuously, at scale.
RLHF’s three-year dominance rested on one empirical fact: it worked better than anything else available. The alignment signal from human preferences was noisy and expensive, but it was real. No lab had a scalable alternative until DeepSeek published one.
The Bottleneck That Made RLHF Unscalable
Three compounding constraints broke RLHF as frontier models scaled beyond GPT-4 class capability:
- Cost: Expert-level annotation for math, law, and code tasks costs $50–$150 per hour. Generating 100,000 high-quality preference pairs for a single training run exceeds $5 million in annotation costs at the top of the quality spectrum.
- Competence ceiling: Human raters cannot reliably evaluate outputs that exceed their own expertise. Inter-rater agreement on complex mathematical reasoning tasks falls below 60% in practice, according to scalable oversight research published by DeepMind in 2023. You cannot align a model to human judgment when human judgment itself is inconsistent.
- Reward hacking: Models trained via RLHF systematically exploit reward model blind spots — producing verbose, confident-sounding responses that score well with raters regardless of factual accuracy. This is Goodhart’s Law applied to neural networks: when a measure becomes a target, it ceases to be a good measure.
By Q3 2024, every major lab’s alignment and post-training teams were running parallel research tracks on human-feedback alternatives. DeepSeek shipped the first production result.
GRPO: DeepSeek’s Critic-Free Breakthrough
Group Relative Policy Optimization (GRPO) was introduced in DeepSeek’s R1 technical report, published January 20, 2025. The core innovation is eliminating the critic (value) network that standard PPO requires.
In PPO, a separate value network trained in parallel estimates expected reward at each state to compute advantage estimates — effectively doubling GPU memory requirements during RL training. GRPO eliminates this entirely by computing advantage estimates relative to the group average across a batch of completions for the same prompt.
The mechanics: for a given input, the model generates G completions. Each receives a reward score from a verifiable reward function. Each completion’s advantage is its reward minus the group mean, normalized by standard deviation. No critic model. No separate value network.
This yields roughly a 40–50% reduction in GPU memory requirements during RL training. More importantly, it enabled DeepSeek-R1-Zero to be trained with pure reinforcement learning from scratch — no supervised fine-tuning seed, no human preference labels of any kind. The model developed multi-step chain-of-thought reasoning, self-verification, and extended thinking traces entirely through binary reward signals on math and coding problems.
The R1-Zero result was the technical shock of early 2025: emergent reasoning without any human-generated reasoning data as a supervision signal.
DAPO: How ByteDance Fixed GRPO’s Instability Problem
Direct Alignment Policy Optimization (DAPO) was published by ByteDance’s Seed and Qwen teams in March 2025 in response to a specific GRPO failure mode: entropy collapse on long-horizon reasoning tasks.
In extended reasoning chains, GRPO training causes the model to converge prematurely — it becomes overconfident in early solution strategies and stops exploring alternatives. DAPO addresses this with four targeted modifications:
- Clip-Higher: Asymmetric probability clipping that allows higher probability increases for high-reward actions than decreases for low-reward ones, preserving exploration throughout training.
- Token-Level KL Control: Replaces sequence-level KL divergence penalty with per-token filtering, enabling granular exploration without full-sequence penalties that stifle the model.
- Dynamic Sampling: Filters training batches to exclude prompts where all completions uniformly succeed or fail, ensuring only high-variance (informative) training signals reach the optimizer.
- Length Penalty: Explicit verbosity penalty to counteract reward hacking through response padding — a direct fix for a known GRPO failure mode.
DAPO was used in training the Qwen3 model series and delivered a 50-point improvement on AIME 2024 benchmarks compared to baseline GRPO on equivalent architectures, per the ByteDance technical report. It is currently the most technically refined variant of group policy optimization in the public literature.
RLVR: When Verifiable Rewards Replaced Human Opinion
Reinforcement Learning from Verifiable Rewards (RLVR) is a training paradigm rather than a single algorithm. The defining characteristic: reward signals come from programmatic verification rather than human annotation.
Applicable domains include:
- Mathematics: The final numerical answer is correct or incorrect — no human needed to score it.
- Code execution: Does it compile? Does it pass the provided test suite?
- Formal reasoning: Can an automated proof checker validate the solution?
- Structured outputs: Does the extraction match the ground-truth schema?
RLVR was central to both DeepSeek-R1-Zero and the subsequent R1 model. Training on binary correctness signals for math and code problems caused the models to spontaneously develop reasoning behaviors — extended internal monologue, self-correction, step verification — that previously required explicit supervised training on human-written chain-of-thought examples.
OpenAI’s o-series models use a variant called process reward models (PRMs), which apply RLVR at the step level rather than only the final answer — rewarding correct intermediate reasoning steps. This produces more reliable multi-step reasoning and reduces the frequency of correct answers reached through flawed methods. Google’s Gemini 2.5 Flash and Pro confirmed similar RLVR-based post-training stacks in technical documentation released Q4 2025.
The commercial pressure driving RLVR adoption is visible in the deals it enables: OpenAI’s $1 billion enterprise partnerships are predicated on the reasoning capability advantages that RLVR-trained models provide over prior RLHF-trained counterparts.
Synthetic Self-Play: Models Teaching Models
Synthetic self-play closes the data loop entirely. Instead of requiring humans to generate training data, the model produces problems, solutions, critiques, and improved solutions in an iterative improvement cycle.
The standard implementation follows four stages:
- A strong base model generates candidate solutions for a defined problem set.
- A verifier — reward model, formal checker, or the model acting as critic — scores each candidate.
- High-scoring completions are retained for supervised fine-tuning or used as positive examples in preference optimization.
- The improved model generates the next problem set at higher difficulty, continuing the cycle.
Anthropic’s Constitutional AI methodology, published December 2022, was an early implementation of this principle — using the model to critique and revise its own outputs against explicit principles. More recent implementations operate at substantially larger scale. DeepSeek’s R1 technical report describes generating over 800,000 synthetic reasoning traces for cold-start supervised fine-tuning before applying GRPO, substantially stabilizing the subsequent RL training phase compared to starting from raw pretraining weights.
The infrastructure required for continuous generation-verification loops at this scale is substantial. The $10 billion AI data center buildout by Nebius in Finland, announced in early 2025, is representative of the dedicated compute investment these self-play pipelines demand at frontier scale.
The 2026 Post-Training Stack by Lab
| Lab | Primary Method(s) | Key Models | Notable Detail |
|---|---|---|---|
| DeepSeek | GRPO + RLVR | R1, R1-Zero, V3 | Pioneered GRPO; R1-Zero trained with zero human preference labels |
| ByteDance / Qwen | DAPO + RLVR | Qwen2.5, Qwen3 | DAPO fixes entropy collapse; +50 pts AIME 2024 vs baseline GRPO |
| OpenAI | RLVR + PRMs + Self-Play | o1, o3, o4-mini | Step-level process reward models; thousands of self-play iterations per model |
| Google DeepMind | RLVR + Self-Play | Gemini 2.5 Flash, Pro | RLVR post-training confirmed in Q4 2025 technical documentation |
| Anthropic | Constitutional AI → RLVR | Claude 3.5, 3.7, Claude 4 | Pioneered model self-critique; RLVR adopted for reasoning capability layers |
| Meta | DPO + RLVR variants | Llama 3.x, Llama 4 | Direct Preference Optimization combined with code and math RLVR |
MegaOne AI tracks 139+ AI tools across 17 categories. The convergence of all six major labs on RLVR variants within a 12-month window is one of the fastest methodology shifts in AI research history — driven not by academic consensus but by benchmark performance pressure in a competitive market where every point on AIME or SWE-bench translates directly to commercial positioning.
What RLHF’s Replacement Means for AI Alignment
The removal of human feedback from the core training loop has direct, unresolved consequences for AI safety. RLHF maintained a continuous human feedback role: every preference comparison was a human judgment about appropriate model behavior. RLVR replaces this with programmatic verification — powerful for domains with unambiguous correct answers, undefined for the domains alignment cares most about: nuanced ethical reasoning, harm refusal calibration, and sensitive topic handling.
The field has acknowledged the gap without closing it. DeepMind’s Paul Christiano has proposed scalable oversight frameworks that use AI to help humans verify AI outputs — effectively AI-assisted alignment at the evaluation layer. Anthropic’s Constitutional AI attempts to encode human values as explicit self-applicable principles. Neither approach has been empirically demonstrated to scale to the capability levels RLVR is now producing.
The pattern is documented in work from Anthropic’s own agent safety team: capability gains consistently outpace alignment methodology. The Humans First movement’s core argument — that automated training removes human oversight precisely when more oversight is needed — is technically accurate. It has not slowed the transition by a single quarter.
The incentive structure is decisive. RLVR produces models that score 15–20 percentage points higher on AIME 2024, 10–15 points higher on SWE-bench Verified, and substantially better on long-context reasoning benchmarks. No competitive lab can sustain RLHF while competitors ship demonstrably stronger models using automated rewards. The economics of capability competition have resolved the debate that alignment research has not.
RLHF is a historical milestone. The post-training stack is now automated, iterative, and largely self-supervised. The urgent unresolved work is determining whether automated reward signals can encode human values precisely enough to make this safe — and the industry has not yet demonstrated that they can at current capability levels.