LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationRLVR & Verifiable Rewards
⚡HardFine-Tuning & Training

RLVR & Verifiable Rewards

Understand RLVR, a post-training approach that uses programmatic verification instead of learned human-preference rewards to improve checked outcomes in math, code, and other contract-driven tasks.

40 min read
Learning path
Step 104 of 155 in the full curriculum
Constitutional AI & Red TeamingKnowledge Distillation for LLMs

Constitutional AI showed how written policy can guide AI feedback for behavior that still needs judgment. RLVR moves to a narrower setting: tasks where a program can check a concrete outcome.

Reinforcement Learning with Verifiable Rewards (RLVR) trains models from rewards assigned by executable checks rather than a learned preference model.[1] This chapter focuses on tasks where an answer, program, or structured action can be tested against a precise contract.

Imagine evaluating two model outputs. One drafts brand copy for a product page; feedback is subjective. The other computes whether a warehouse can ship 1,200 orders before a carrier cutoff; a calculator can check its arithmetic once the constraints are specified. The first requires judgment. The second has a verifiable component. Tulu 3 introduced the term RLVR for this type of post-training stage, and DeepSeek-R1 used rule-based rewards for reasoning tasks inside a broader multi-stage pipeline.[1][2]

RLVR training pipeline with prompt, candidate rollouts, automatic verifier, and policy update. RLVR training pipeline with prompt, candidate rollouts, automatic verifier, and policy update.
The RLVR loop: sample candidate answers, grade them with an executable verifier, then update the policy. PPO, GRPO, and other RL can consume verifiable rewards.

The alignment space

To understand where RLVR fits, we need to look at how Large Language Models (LLMs) are traditionally trained to follow instructions.

Most instruction-following pipelines use Supervised Fine-Tuning (SFT), where the model imitates selected demonstrations. Think of SFT like showing a support model how an expert handles a damaged-item return, step by step. It teaches patterns present in those examples; it doesn't directly reward a newly sampled solution for satisfying an external checker.

One post-training route is Reinforcement Learning from Human Feedback (RLHF)[3], where a reward model is trained to predict preference labels. That signal fits qualities such as tone or helpfulness, but depends on label collection and reward-model validity outside its training examples.

Direct Preference Optimization (DPO)[4] removes the online reward-model optimization loop and trains directly from chosen/rejected pairs. Those pairs can come from humans, constitutions, or other pipelines, but they are still preference comparisons rather than executed correctness checks.

RLVR takes a different path. It specializes in tasks where success can be programmatically verified, such as final-answer math, code under tests, or formally checked proofs. It avoids collecting a preference label for each rollout, but its validity is only as strong as the specification, parser, test suite, and that assign reward.

MethodReward SignalScalabilityTask Scope
SFTDemonstration dataLimited by demonstration supplyAny task with demonstrations
RLHFLearned reward from preference labelsRequires labels and reward-model checksOpen-ended behavior goals
DPOChosen/rejected pairsAvoids online reward-model RLTasks expressible as preferences
RLVRExecutable verificationRepeatable if verifier is cheapOutcomes covered by a verifier

This is the central trade: RLVR gives up breadth to get a directly executable signal. In system design, the first question isn't "Which alignment method sounds modern?" It's "What behavior can this checker prove?" A logistics lab might verify route capacity and cutoff-time constraints by code. It still needs judgment for whether a delivery promise is acceptable to a customer.

Decision router for choosing SFT, RLHF, DPO, or RLVR based on whether the task has demonstrations, preference comparisons, or a programmatic verifier. Decision router for choosing SFT, RLHF, DPO, or RLVR based on whether the task has demonstrations, preference comparisons, or a programmatic verifier.
The first design question isn't which post-training acronym is newest. It is what kind of supervision the task can honestly provide.

Defining verifiable rewards

In the simplest outcome-only setup, a verifiable reward function r(x,y)r(x, y)r(x,y) takes a problem xxx and a generated response yyy and returns a binary signal:

r(x,y)={1if y is correct (verified)0if y is incorrectr(x, y) = \begin{cases} 1 & \text{if } y \text{ is correct (verified)} \\ 0 & \text{if } y \text{ is incorrect} \end{cases}r(x,y)={10​if y is correct (verified)if y is incorrect​

To see why this matters, imagine a prompt that asks: "Is 97 a prime number? Format your answer as Yes\boxed{Yes}Yes​ or No\boxed{No}No​."

  • Iteration 1: The model says "Yes, because it ends in 7." The final answer happens to be correct, so an outcome-only verifier returns 1.0, even though the stated reason is invalid.
  • Iteration 2: The model says "No, 97/3 = 32.3." The answer is wrong, so the verifier returns 0.0.
  • Iteration 3: The model says "97 is prime because it's not divisible by 2, 3, 5, or 7." The answer is correct and the reasoning is solid, so the verifier returns 1.0.

The important limitation is visible immediately: for this prompt, Iterations 1 and 3 get the same reward. An outcome verifier does not tell the optimizer which correct answer used valid reasoning. Across many problems, reasoning patterns that correlate with correct outcomes may become more likely, but that is an empirical training result, not information contained in a single final-answer reward.

The following tiny check makes that limitation concrete. It verifies only the boxed final verdict, so an unsupported guess and a valid divisibility check receive identical reward.

outcome-reward-blind-spot.py
1import re 2 3def verdict_reward(response: str, expected: str) -> float: 4 matches = re.findall(r"\\boxed\{(Yes|No)\}", response) 5 return 1.0 if matches == [expected] else 0.0 6 7rollouts = { 8 "lucky_reason": r"It ends in 7, so it must be prime. \boxed{Yes}", 9 "valid_check": r"Check divisors at most sqrt(97): 2, 3, 5, 7 fail. \boxed{Yes}", 10 "wrong_answer": r"97 is divisible by 3. \boxed{No}", 11} 12 13for name, response in rollouts.items(): 14 print(f"{name}: reward={verdict_reward(response, 'Yes')}")
Outcome reward blind spot
1lucky_reason: reward=1.0 2valid_check: reward=1.0 3wrong_answer: reward=0.0

Many RLVR systems start with an all-or-nothing outcome signal because a deterministic checker can compute it repeatedly. Others add rule-based terms, so reward remains executable without being purely binary. The trade-off is sparse credit assignment: if a response makes one late arithmetic error, a final-answer checker gives zero even if earlier steps were useful.

Key insight: Verifiable means checkable, not complete. A warehouse inventory checker can prove whether the reported total matches its inputs; it can't prove that a correct total came from a general strategy unless the evaluation tests that generalization.

Verifier contract diagram showing raw model output passing through extraction, normalization, deterministic checking, and fail-closed reward assignment. Verifier contract diagram showing raw model output passing through extraction, normalization, deterministic checking, and fail-closed reward assignment.
A verifier is a production contract: extract answer, normalize it, check it deterministically, and fail closed on ambiguity.

In practice, systems often mix correctness rewards with other rule-based terms. DeepSeek-R1-Zero, for example, combined accuracy rewards with format rewards so outputs stayed easy to parse and verify.[2]

Outcome vs. process supervision

The simplest form is outcome supervision: score the final answer. An RLVR setup can do this with a rule-based checker. Related work has also trained learned verifiers to score final answers.[5] This gives sparse feedback.

Process supervision scores intermediate steps instead of only the final result. Lightman et al. compare outcome and process reward models trained with human labels on mathematical reasoning steps; that is adjacent to RLVR, not itself proof that each process score is programmatically verifiable.[6] A formal system can instead run a checker after each proof step. Either version offers more localized credit, but requires a trustworthy step-level signal.

This flowchart visualizes the difference between outcome and process supervision, highlighting the trade-off between scalability and signal quality:

Diagram showing Prompt + candidate response, Outcome verifier, Final answer reward 0 or 1, and Cheap and scalable. Diagram showing Prompt + candidate response, Outcome verifier, Final answer reward 0 or 1, and Cheap and scalable.
Prompt + candidate response, Outcome verifier, Final answer reward 0 or 1, and Cheap and scalable.

Caption: Outcome supervision scores the final answer. Process supervision supplies step-level feedback, either from labels or from a checker when one exists, at higher construction cost.

Examples of verifiers

Mathematics

We check if the final answer matches the ground truth, even if the reasoning path is different. The verifier below extracts exactly one boxed expression from the model's response and uses sympy to verify mathematical equivalence. This handles cases like 1/21/21/2 versus 0.50.50.5 correctly, while rejecting missing or ambiguous answer fields.

The following implementation is small enough to run locally, but it still captures the core production contract: extract the final answer, compare it symbolically, and fail closed on parsing errors. Production math verifiers add more normalization, timeouts, and adversarial parser tests.

mathematics.py
1import sympy as sp 2 3def extract_boxed_exprs(text: str) -> list[str]: 4 marker = "\\boxed{" 5 expressions: list[str] = [] 6 offset = 0 7 while (start := text.find(marker, offset)) != -1: 8 depth = 0 9 expr_chars: list[str] = [] 10 for ch in text[start + len(marker):]: 11 if ch == "{": 12 depth += 1 13 expr_chars.append(ch) 14 elif ch == "}": 15 if depth == 0: 16 expressions.append("".join(expr_chars).strip()) 17 offset = start + len(marker) + len(expr_chars) + 1 18 break 19 depth -= 1 20 expr_chars.append(ch) 21 else: 22 expr_chars.append(ch) 23 else: 24 return [] 25 return expressions 26 27def math_verifier(problem: str, answer: str, ground_truth: str) -> float: 28 """Verify mathematical equivalence of the final answer.""" 29 predicted_fields = extract_boxed_exprs(answer) 30 expected_fields = extract_boxed_exprs(ground_truth) 31 if len(predicted_fields) != 1 or len(expected_fields) != 1: 32 return 0.0 33 try: 34 predicted = sp.sympify(predicted_fields[0]) 35 expected = sp.sympify(expected_fields[0]) 36 return 1.0 if sp.simplify(predicted - expected) == 0 else 0.0 37 except (TypeError, ValueError, sp.SympifyError): 38 return 0.0 39 40equivalent_score = math_verifier("Compute half of one.", r"The answer is \boxed{0.5}.", r"\boxed{1/2}") 41wrong_score = math_verifier("Compute 12 times 14.", r"\boxed{148}", r"\boxed{168}") 42ambiguous_score = math_verifier("Compute 12 times 14.", r"Maybe \boxed{168} or \boxed{148}.", r"\boxed{168}") 43print(equivalent_score) 44print(wrong_score) 45print(ambiguous_score) 46print(f"equivalent_passes={equivalent_score == 1.0}") 47print(f"wrong_answer_rejected={wrong_score == 0.0}") 48print(f"ambiguous_answer_rejected={ambiguous_score == 0.0}")
Output
11.0 20.0 30.0 4equivalent_passes=True 5wrong_answer_rejected=True 6ambiguous_answer_rejected=True

Code generation

Running generated code against a strong suite of hidden test cases checks concrete behavior. It doesn't prove full correctness unless the test suite is complete, but it is stronger than judging code by surface form. The local example below runs candidate code in a separate Python process and times out slow attempts. It isn't a security sandbox; production systems still need containers, seccomp, filesystem isolation, network controls, and resource limits.

code-generation.py
1import json 2import subprocess 3import sys 4from dataclasses import dataclass 5 6@dataclass 7class TestCase: 8 inputs: tuple[int, ...] 9 expected: int 10 11def code_verifier(problem: str, code: str, test_cases: list[TestCase]) -> float: 12 """Execute candidate in a separate process and fail closed.""" 13 cases = [(case.inputs, case.expected) for case in test_cases] 14 runner = ( 15 code 16 + "\nimport json\n" 17 + f"cases = {cases!r}\n" 18 + "passed = all(solve(*inputs) == expected for inputs, expected in cases)\n" 19 + "print(json.dumps({'passed': passed}))\n" 20 ) 21 try: 22 completed = subprocess.run( 23 [sys.executable, "-I", "-c", runner], 24 capture_output=True, 25 text=True, 26 timeout=1.0, 27 check=False, 28 ) 29 except subprocess.TimeoutExpired: 30 return 0.0 31 if completed.returncode != 0: 32 return 0.0 33 try: 34 result = json.loads(completed.stdout) 35 except json.JSONDecodeError: 36 return 0.0 37 return 1.0 if result == {"passed": True} else 0.0 38 39cases = [TestCase((14, 12), 168), TestCase((8, 12), 96)] 40good = "def solve(a, b):\n return a * b\n" 41bad = "def solve(a, b):\n return a + b\n" 42 43good_score = code_verifier("multiply two integers", good, cases) 44bad_score = code_verifier("multiply two integers", bad, cases) 45print(good_score) 46print(bad_score) 47print(f"good_passes={good_score == 1.0}") 48print(f"bad_rejected={bad_score == 0.0}")
Output
11.0 20.0 3good_passes=True 4bad_rejected=True

Visible tests aren't enough when the policy can memorize examples or infer shortcuts. In this miniature check, a candidate passes the two examples shown during development but fails a held-out case. A reward based only on visible tests would reinforce the wrong program.

held-out-code-tests.py
1from collections.abc import Callable 2 3def shortcut_multiplier(a: int, b: int) -> int: 4 known = {(14, 12): 168, (8, 12): 96} 5 return known.get((a, b), 0) 6 7def pass_rate(fn: Callable[[int, int], int], cases: list[tuple[int, int, int]]) -> float: 8 passed = sum(fn(a, b) == expected for a, b, expected in cases) 9 return passed / len(cases) 10 11visible = [(14, 12, 168), (8, 12, 96)] 12held_out = [(7, 9, 63), (11, 13, 143)] 13 14print(f"visible_pass_rate={pass_rate(shortcut_multiplier, visible):.0%}") 15print(f"held_out_pass_rate={pass_rate(shortcut_multiplier, held_out):.0%}") 16print(f"ship={pass_rate(shortcut_multiplier, held_out) == 1.0}")
Held-out test check
1visible_pass_rate=100% 2held_out_pass_rate=0% 3ship=False

Formal logic

Proof assistants such as Lean or Isabelle can check that a candidate proof term satisfies a formal theorem under their kernel and imported definitions. In production, a verifier would wrap the theorem statement and candidate proof into a Lean script, then return 1.0 only if the prover accepts the full artifact.

The prover call itself is infrastructure-specific: it depends on your Lean version, package cache, sandbox, and timeout policy. Keep the reward conversion boring and testable, and keep the theorem-prover runner behind a clear boundary.

formal-logic.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class ProverResult: 5 accepted: bool 6 stderr: str 7 8def proof_reward(result: ProverResult) -> float: 9 """Convert a theorem prover result into a verifiable reward.""" 10 return 1.0 if result.accepted else 0.0 11 12accepted = proof_reward(ProverResult(accepted=True, stderr="")) 13rejected = proof_reward(ProverResult(accepted=False, stderr="unknown identifier")) 14print(accepted) 15print(rejected) 16print(f"accepted_maps_to_one={accepted == 1.0}") 17print(f"rejected_maps_to_zero={rejected == 0.0}")
Output
11.0 20.0 3accepted_maps_to_one=True 4rejected_maps_to_zero=True

What can't be verified?

RLVR is powerful but narrow. It works best when the task has an objective spec and a verifier that fails closed. That rules out many open-ended tasks, or at least forces you to break them into smaller verifiable sub-problems. It's not a clean fit for:

  • Open-ended writing: "Write three product-page headlines for a merchant catalog" has no objective truth.
  • Subjective analysis: "Explain the brand implications of a supply-chain delay" has many valid answers.
  • Safety alignment: Deciding if a response is harmful usually requires policy judgment or an AI proxy, which is closer to RLHF or RLAIF than a deterministic correctness check.

Group relative policy optimization (GRPO)

DeepSeek-R1 used Group Relative Policy Optimization (GRPO), an algorithm introduced in DeepSeekMath[7] that avoids a learned value model by estimating advantages from a group of samples for the same prompt. DeepSeek-R1 later used GRPO in its reasoning-training stages.[2]

The problem with PPO

In standard Proximal Policy Optimization (PPO)[8], we need a Critic model (or Value Function) that predicts "how good is this state?". The Critic estimates the expected future reward Vϕ(st)V_\phi(s_t)Vϕ​(st​).

In the terminal-reward setting common to LLM training (where the reward is observed only at the end of a sequence), the advantage simplifies to:

At=R−Vϕ(st)A_t = R - V_\phi(s_t)At​=R−Vϕ​(st​)

Where RRR is the final reward and VϕV_\phiVϕ​ is the critic's value estimate. More generally, the advantage uses a temporal-difference error δt=rt+γVϕ(st+1)−Vϕ(st)\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)δt​=rt​+γVϕ​(st+1​)−Vϕ​(st​), often computed via GAE (Generalized Advantage Estimation) over the full sequence.

Training a critic alongside an LLM adds model memory and compute. With terminal rewards, it must also predict eventual success from partial generations, which becomes difficult when a long response can later revise an earlier mistake.[2]

The GRPO solution

GRPO eliminates the separate Critic model entirely. Instead of learning a value function that tries to predict how good any reasoning trajectory is, GRPO computes advantages relative to the other samples generated for the exact same prompt in the current batch.

The intuition is direct for verifiable tasks. For a given math problem or code task, sample several outputs. Some reach the checked outcome and receive high reward; others fail and receive low reward. The successful outputs are better than average for this sampled group. You don't need a learned critic that estimates difficulty across prompts; you do need reward variation inside each group.

Mathematically, for a prompt qqq you sample a group of GGG outputs {o1,…,oG}\{o_1, \dots, o_G\}{o1​,…,oG​}. The advantage for output iii is the z-score of its reward within that group:

Ai=ri−mean({r1,…,rG})std({r1,…,rG})A_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\})}Ai​=std({r1​,…,rG​})ri​−mean({r1​,…,rG​})​

Positive advantages push the policy toward those trajectories; negative advantages push it away. Local normalization removes the separate critic-estimation problem and compares attempts at the same prompt difficulty. It doesn't guarantee stable training: if all sampled outputs receive the same reward, the group contributes no comparative signal.

Where rir_iri​ is sample iii's reward and normalization is done within the sampled group of size GGG for the same prompt.

Here is a concrete example with G=8G = 8G=8 samples for a single math prompt. The verifier returns 1.0 for a correct final answer and 0.0 for an incorrect one:

SampleAnswerReward rir_iri​Group MeanGroup StdAdvantage AiA_iAi​
11681.00.50.5+1.0
21480.00.50.5-1.0
31560.00.50.5-1.0
41681.00.50.5+1.0
51700.00.50.5-1.0
61681.00.50.5+1.0
71440.00.50.5-1.0
81681.00.50.5+1.0

How to read the table: Four samples got the right answer (168), so their rewards are 1.0. Four got it wrong, so their rewards are 0.0. The group mean is 4/8=0.54/8 = 0.54/8=0.5, and the standard deviation is 0.5. Sample 1's advantage is (1.0−0.5)/0.5=+1.0(1.0 - 0.5) / 0.5 = +1.0(1.0−0.5)/0.5=+1.0. That positive advantage tells the optimizer to increase the probability of the that led to sample 1. Sample 2's advantage is (0.0−0.5)/0.5=−1.0(0.0 - 0.5) / 0.5 = -1.0(0.0−0.5)/0.5=−1.0, so the optimizer pushes down the probability of the tokens that led to sample 2.

Notice what the table hides as well as what it shows. Every correct answer gets the same positive advantage, and every wrong answer gets the same negative advantage. GRPO doesn't know why sample 1 was correct; it only knows that sample 1 was better than the average for this prompt. Over many prompts, the policy learns which reasoning patterns reliably produce above-average rewards.

GRPO group advantage diagram showing eight sampled answers, binary verifier rewards, group mean 0.5, group standard deviation 0.5, and positive or negative advantages. GRPO group advantage diagram showing eight sampled answers, binary verifier rewards, group mean 0.5, group standard deviation 0.5, and positive or negative advantages.
GRPO replaces a learned critic with group-relative statistics. Samples that beat group mean get positive advantage; samples below the mean get pushed down.

At a high level, the update looks like a PPO-clipped objective with group-relative advantage and a KL penalty. The published GRPO objective is applied at generated-token level; this sequence-level sketch keeps the two ideas visible without reproducing all implementation detail:

LGRPO=−1G∑i=1G[min⁡(ρiAi,  clip(ρi,1−ϵ,1+ϵ)Ai)−βDKL(πθ∥πref)]\begin{aligned} \mathcal{L}_{\text{GRPO}} = -\frac{1}{G} \sum_{i=1}^{G} \Big[ &\min\left(\rho_i A_i,\; \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i\right) \\ &- \beta D_{\text{KL}}(\pi_\theta \|\pi_{\text{ref}}) \Big] \end{aligned}LGRPO​=−G1​i=1∑G​[​min(ρi​Ai​,clip(ρi​,1−ϵ,1+ϵ)Ai​)−βDKL​(πθ​∥πref​)]​

Where ρi=πθ(oi∣q)πold(oi∣q)\rho_i = \frac{\pi_\theta(o_i|q)}{\pi_{\text{old}}(o_i|q)}ρi​=πold​(oi​∣q)πθ​(oi​∣q)​ is a sequence-level policy-ratio shorthand, AiA_iAi​ is the normalized group advantage, ϵ\epsilonϵ is the clipping threshold, and β\betaβ controls the drift penalty toward the reference policy.

The Kullback-Leibler (KL) term penalizes drift from a reference policy. It can constrain movement toward a verifier exploit; it can't prove that the verifier represents the intended behavior.

One caveat falls straight out of the formula: if every sample in the group gets the same reward, the standard deviation is 0 and vanilla GRPO gets no learning signal from that group. That's one reason prompt difficulty, group size, and reward shaping matter in practice.

Common mistake: Saying that three equal rewards out of four collapse the GRPO signal. They don't: the one different outcome creates reward variance. Signal disappears when every sampled output gets the same reward, such as all failures on a prompt that is too hard or all passes on a prompt that is too easy. DeepSeek-R1 reports sampling 16 outputs per question in its reasoning RL setups.[2]

This diagnostic identifies which prompt groups supply comparative signal before an update. The mixed group is usable even though three out of four answers fail; the all-fail and all-pass groups carry no relative ranking information.

reward-spread-diagnostic.py
1import torch 2 3groups = { 4 "mixed": torch.tensor([0.0, 0.0, 0.0, 1.0]), 5 "all_fail": torch.tensor([0.0, 0.0, 0.0, 0.0]), 6 "all_pass": torch.tensor([1.0, 1.0, 1.0, 1.0]), 7} 8 9for name, rewards in groups.items(): 10 std = float(rewards.std(unbiased=False)) 11 informative = std > 0.0 12 print(f"{name}: std={std:.3f}, informative={informative}")
Reward spread diagnostic
1mixed: std=0.433, informative=True 2all_fail: std=0.000, informative=False 3all_pass: std=0.000, informative=False

Here is a runnable scalar version of the GRPO math. It treats each sampled trajectory as one log-probability value, which is enough to test the group-normalized advantage, PPO-style clipping, and KL penalty. A production trainer does this at token level, batches generation across all GGG samples, and stores the old-policy log probabilities during rollout.

the-grpo-solution.py
1import torch 2 3def group_advantages(rewards: torch.Tensor) -> torch.Tensor: 4 std = rewards.std(unbiased=False) 5 if float(std) == 0.0: 6 return torch.zeros_like(rewards) 7 return (rewards - rewards.mean()) / (std + 1e-8) 8 9def clipped_grpo_loss( 10 log_probs_new: torch.Tensor, 11 log_probs_old: torch.Tensor, 12 log_probs_ref: torch.Tensor, 13 advantages: torch.Tensor, 14 epsilon: float = 0.2, 15 beta: float = 0.01, 16) -> torch.Tensor: 17 ratios = torch.exp(log_probs_new - log_probs_old) 18 clipped_ratios = torch.clamp(ratios, 1 - epsilon, 1 + epsilon) 19 policy_loss = -torch.min(ratios * advantages, clipped_ratios * advantages).mean() 20 21 log_ratio_ref = log_probs_new - log_probs_ref 22 ratio_ref = torch.exp(log_ratio_ref) 23 kl = (ratio_ref * log_ratio_ref - ratio_ref + 1).mean() 24 return policy_loss + beta * kl 25 26rewards = torch.tensor([1, 0, 0, 1, 0, 1, 0, 1], dtype=torch.float32) 27advantages = group_advantages(rewards) 28advantages_match = bool(torch.allclose(advantages, torch.tensor([1, -1, -1, 1, -1, 1, -1, 1], dtype=torch.float32))) 29 30trajectory_log_probs = torch.nn.Parameter(torch.zeros(8)) 31old_log_probs = torch.zeros(8) 32reference_log_probs = torch.zeros(8) 33 34optimizer = torch.optim.SGD([trajectory_log_probs], lr=0.1) 35loss = clipped_grpo_loss(trajectory_log_probs, old_log_probs, reference_log_probs, advantages) 36loss.backward() 37 38# Positive-advantage trajectories should be pushed up; negative ones down. 39grad_signs_ok = ( 40 trajectory_log_probs.grad is not None 41 and trajectory_log_probs.grad[0] < 0 42 and trajectory_log_probs.grad[1] > 0 43) 44 45optimizer.step() 46step_direction_ok = trajectory_log_probs[0] > 0 and trajectory_log_probs[1] < 0 47 48print("advantages:", [round(float(x), 1) for x in advantages]) 49print("updated log probs:", [round(float(x), 3) for x in trajectory_log_probs.detach()]) 50print(f"advantages_match={advantages_match}") 51print(f"grad_signs_ok={bool(grad_signs_ok)}") 52print(f"step_direction_ok={bool(step_direction_ok)}")
Output
1advantages: [1.0, -1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0] 2updated log probs: [0.013, -0.013, -0.013, 0.013, -0.013, 0.013, -0.013, 0.013] 3advantages_match=True 4grad_signs_ok=True 5step_direction_ok=True

GRPO is an optimizer, not a synonym for RLVR. Tülu 3 trained its RLVR stage with PPO, while DeepSeekMath introduced GRPO as a PPO variant and DeepSeek-R1 used GRPO in its reasoning stages.[1][7][2] Both optimizers can consume rule-based verifier rewards.

FeaturePPO-style policy optimizationGRPO
BaselinePredicted by a separate Critic modelMean reward of the sampled group
Additional value-model costTrains a critic/value modelAvoids a separate critic
Main estimation riskCritic must predict returnsGroups with equal rewards give no relative signal
Compatible reward signalsLearned or rule-based rewardLearned or rule-based reward, including binary checks

Self-check: In the worked table above, what would happen to the advantages if all 8 samples got the same reward? Why does that make prompt difficulty and group size important in practice?

Why GRPO often fits RLVR

Memory efficiency

No need to hold a large Critic model in GPU memory.

Comparative baseline

The group mean is local to the current prompt. GRPO avoids training a critic to predict returns from partial reasoning traces, though it still depends on reward spread, a sound verifier, and well-tuned optimization.

Binary rewards are a natural fit

Binary {0,1}\{0,1\}{0,1} rewards make the group comparison easy to read: when a group contains both passes and failures, passes receive positive advantage and failures negative advantage. If all candidates pass or all fail, their advantages are zero. No value function is needed, but prompt selection and exploration must produce mixed outcomes often enough to learn.

DeepSeek-R1 training pipeline

The success of RLVR depends on the pipeline, not only the algorithm. DeepSeek-R1[2] used a specific multi-stage process to bootstrap reasoning.

This diagram summarizes DeepSeek-R1's reported multi-stage pipeline, where supervised fine-tuning and reinforcement learning alternate to improve checked reasoning outcomes, readability, and broader assistant behavior:

Diagram showing Base model, Stage 1: cold-start SFT readable reasoning, Stage 2: reasoning RL GRPO + rule-based rewards, and Rule-based rewards accuracy + format + language consistency. Diagram showing Base model, Stage 1: cold-start SFT readable reasoning, Stage 2: reasoning RL GRPO + rule-based rewards, and Rule-based rewards accuracy + format + language consistency.
Base model, Stage 1: cold-start SFT readable reasoning, Stage 2: reasoning RL GRPO + rule-based rewards, and Rule-based rewards accuracy + format + language consistency.

Caption: DeepSeek-R1's reported pipeline. Stage 1 targets readable cold-start behavior. Stage 2 applies GRPO with rule-based rewards on reasoning tasks, including a language-consistency reward. Stage 3 collects filtered traces into SFT data and mixes in broader instruction data. Stage 4 combines reasoning rewards with general-behavior reward models.

Stage 1: cold start (not mandatory, often useful)

DeepSeek-R1-Zero showed that DeepSeek-V3-Base could improve on reported reasoning evaluations through rule-based RL without a preliminary SFT phase.[2] That result doesn't mean every base model or verifier will train usefully from zero-reward-heavy rollouts.

DeepSeek still added cold-start data for DeepSeek-R1 because R1-Zero had poor readability and mixed languages. The paper describes collecting thousands of long-CoT examples, filtering for a readable response pattern, and using them to initialize the model before RL.[2] In other projects, cold start also helps when the base model's initial pass rate is so low that almost every rollout gets zero reward.

Stage 2: reasoning RL (GRPO)

This stage trains on math, code, and logic prompts using GRPO with rule-based rewards. In DeepSeek-R1-Zero, those rewards were accuracy plus format. In DeepSeek-R1, the first RL stage also added a language-consistency reward to reduce mixed-language chains of thought, with the paper reporting a slight performance trade-off in its ablation.[2]

For rule-graded prompts, a response that fails the checked outcome doesn't receive the accuracy reward no matter how persuasive its prose is. That still leaves coverage risk: a pass can reflect a shortcut if the verifier or evaluation set fails to test it.

Stage 3: rejection sampling and SFT

Once the RL policy is producing higher-scoring traces, they can become new demonstrations. DeepSeek-R1 used rejection sampling, retaining checked-correct responses for rule-gradable data and using DeepSeek-V3 judgments for some expanded reasoning data.[2]

In the paper, that stage produced about 600k reasoning samples plus about 200k non-reasoning samples, for roughly 800k total.[2] This turns exploratory RL behavior into a stable supervised dataset and helps recover capabilities that pure reasoning RL doesn't optimize for, such as writing quality and everyday instruction following.

Stage 4: final RL (All scenarios)

Pure reasoning RL can produce a model that's strong on benchmarks but rough around the edges. In the final stage, DeepSeek runs RL again, this time on a broader mix of scenarios.[2]

This uses a mix of:

  1. Rule-based rewards to maintain strong reasoning capabilities.
  2. Reward models (RLHF) for helpfulness and harmlessness in general conversation.

This final stage is intended to combine reasoning performance with broader helpfulness and harmlessness objectives; those outcomes must still be measured rather than assumed.

Observed reasoning patterns

Outcome-checked RL doesn't label each reasoning step. In DeepSeek-R1-Zero, the authors report longer responses and visible patterns such as verification, reflection, and exploration of alternatives during training.[2] Treat those as observed output behaviors, not proof of an internal reasoning mechanism.

Self-verification patterns

An output may pause and check its work.

"Wait, let me double-check that calculation. 14×1214 \times 1214×12 is 168168168, not 148148148. I made a mistake."

In a logistics setting, the same behavior appears when a model verifies a shipping quote:

"The subtotal is $47.50 and tax is 6%. 47.50 × 0.06 = 2.85, so the total should be $50.35. Let me confirm: 47.50 + 2.85 = 50.35. Correct."

DeepSeek-R1-Zero wasn't trained from step-level labels saying "write let me check here." Its paper reports a sharp increase in the use of "wait" during reflection later in training.[2] The cautious interpretation is that RL changed the distribution of generated traces and that some reflective-looking traces co-occurred with improved checked outcomes. An outcome-only verifier doesn't establish why those traces improved.

Backtracking patterns

An output may abandon one approach and try another.

"This approach using geometry seems too complex. Let me try using coordinate algebra instead."

This can look search-like on the surface, but the training loop is still autoregressive generation plus reward-weighted updates, not an explicit tree-search controller. A verifier credits the final checked outcome; it doesn't separately establish that the pivot was necessary.

Extended thought

In DeepSeek-R1-Zero, average response length jumped sharply during training alongside reported accuracy gains.[2] This connects RLVR to test-time compute because trained policies may emit longer traces on reasoning tasks. Longer traces aren't automatically better, though; their value must be measured against correctness, latency, and token cost.

A useful self-check: suppose a model trained with outcome rewards starts saying "Wait, let me double-check that" before finalizing answers. What can you conclude from the output pattern, and what would require separate evaluation? The reward confirms correct final answers, not the necessity or faithfulness of the written reflection.

Research caution: It remains open how much RLVR creates new problem-solving behavior versus eliciting behavior already likely under the base model. Shao et al. found that random and format-only rewards substantially improved MATH-500 results for Qwen2.5-Math in their setup, while comparable spurious rewards gave little benefit or harmed Llama3 and OLMo2 variants.[9] They connect this result to GRPO clipping and model-specific high-prior behaviors. Practical takeaway: compare against weak or spurious-reward baselines and validate across model families and held-out tasks.

Reward hacking and failure modes

Any time you optimize a metric, a policy can exploit gaps in that metric. RLVR is no exception. If the verifier checks only final answers, a memorized answer, leaked target, or weak parser can receive reward without demonstrating general solution ability. DeepSeek-R1 explicitly identifies reliable reward construction as a limitation once tasks cannot be graded by dependable rules.[2]

Don't train against a verifier before adversarially testing it. Any malformed output or shortcut that earns reward can be reinforced by the optimization loop.

Format gaming

A format-heavy reward can favor output wrappers over correctness. For example, if the verifier gives substantial credit for \boxed{...} before checking the value, it can reinforce neatly formatted wrong answers. A parser that accepts any matching line also risks rewarding output that contains several contradictory answers.

The symptom is high format compliance but low task accuracy. The cause is that the reward function prizes formatting before correctness. The fix is to make correctness dominant and fail closed on wrong or ambiguous contents, even if the wrapper is present.

This example compares a broken format-first reward with a correctness-gated version. A wrong boxed answer should never get more reward than an unboxed correct answer just because it is easy to parse.

format-reward-trap.py
1import re 2 3def boxed_value(response: str) -> int | None: 4 matches = re.findall(r"\\boxed\{(\d+)\}", response) 5 return int(matches[0]) if len(matches) == 1 else None 6 7def broken_reward(response: str, expected: int) -> float: 8 value = boxed_value(response) 9 return 0.7 if value is not None else (0.3 if str(expected) in response else 0.0) 10 11def gated_reward(response: str, expected: int) -> float: 12 value = boxed_value(response) 13 return 1.0 if value == expected else 0.0 14 15wrong_boxed = r"The answer is \boxed{148}." 16right_plain = "The answer is 168." 17 18print(f"broken_prefers_wrong={broken_reward(wrong_boxed, 168) > broken_reward(right_plain, 168)}") 19print(f"gated_wrong_boxed={gated_reward(wrong_boxed, 168)}")
Format reward trap
1broken_prefers_wrong=True 2gated_wrong_boxed=0.0

Shortcut exploitation

If a training set has shortcuts (for example, one multiple-choice position is correct much more often), optimizing checked training reward can favor that shortcut. A code verifier that treats execution errors as passing outcomes creates an even more direct loophole.

Mitigation strategies

To prevent policies from exploiting the verifier, several defensive practices help during training:

  • Explicit verifier contracts: Use symbolic checking (such as sympy) and isolated execution environments instead of brittle regex parsing. Ensure the verifier fails closed (returns 0 on any parsing error).
  • Difficulty calibration: Include prompts where rollouts produce both passing and failing outputs often enough for group-relative learning; all-zero groups contribute no comparative advantage.
  • Held-out tests: Evaluate on independent problems and, when possible, an independently implemented checker so a parser shortcut isn't mistaken for generalization.
  • Length Controls: Cap rollout length or add carefully tuned penalties so the model doesn't learn to think forever without improving correctness.
RLVR verifier pressure map showing rollouts flowing through extraction, checking, and policy update, with scoped verification measuring checked outcomes while weak verification rewards shortcuts. RLVR verifier pressure map showing rollouts flowing through extraction, checking, and policy update, with scoped verification measuring checked outcomes while weak verification rewards shortcuts.
RLVR pushes on whatever the verifier measures. A scoped checker can prove selected outcomes; a weak checker can reward shortcuts.

Common pitfalls

Assuming RLVR works for any task. The symptom is a reward function that secretly depends on taste, safety judgment, or fuzzy grader text. The cause is forcing an objective RL method onto a subjective task. The fix is to ask whether a deterministic checker, theorem prover, unit test suite, database constraint, or schema validator can grade the output. If not, use RLHF, DPO, Reinforcement Learning from AI Feedback (RLAIF), or decompose the task into smaller verifiable checks.

Treating cold-start SFT as either mandatory or useless. DeepSeek-R1-Zero improved reported reasoning-task results from DeepSeek-V3-Base without a preliminary SFT phase.[2] DeepSeek-R1 still used cold-start SFT because readability and language consistency mattered. The fix is to measure the base model's initial pass rate and output quality. If almost every rollout gets zero reward or the traces are unreadable, cold-start data may make RL more tractable.

Celebrating format accuracy. A verifier that rewards \boxed{} too strongly can teach box-writing instead of problem-solving. The fix is to keep correctness dominant, run adversarial parser tests, and track real task accuracy separately from format compliance.

Ignoring general capability drift. A model trained heavily on math or code RLVR can become better at those tasks while getting worse at normal instruction following. The fix is to use KL anchoring, mix in general instruction data during later SFT stages, and run broad evaluations such as writing, safety, and everyday chat alongside math or code metrics.

This release gate catches a reasoning gain that comes with an unacceptable support-quality regression. The numbers are illustrative; each product should define thresholds before training.

capability-regression-gate.py
1baseline = {"checked_reasoning": 0.61, "support_helpfulness": 0.92, "false_refusal": 0.04} 2candidate = {"checked_reasoning": 0.74, "support_helpfulness": 0.81, "false_refusal": 0.13} 3 4violations = [] 5if candidate["checked_reasoning"] <= baseline["checked_reasoning"]: 6 violations.append("no reasoning gain") 7if candidate["support_helpfulness"] < baseline["support_helpfulness"] - 0.03: 8 violations.append("support helpfulness regressed") 9if candidate["false_refusal"] > baseline["false_refusal"] + 0.02: 10 violations.append("false refusals increased") 11 12print(f"reasoning_gain={candidate['checked_reasoning'] - baseline['checked_reasoning']:+.0%}") 13print(f"ship={not violations}") 14print(violations)
Capability gate output
1reasoning_gain=+13% 2ship=False 3['support helpfulness regressed', 'false refusals increased']

Treating RLVR and distillation as competing choices. RLVR can improve checked outcomes for a teacher; distillation transfers sampled teacher behavior into a smaller model. The fix is to decide which bottleneck you have: if the teacher fails verifiable evaluations, train against better checks; if a capable teacher is too expensive to serve, consider distillation.

A tiny verifier lab

The best way to understand RLVR is to build a verifier yourself. You don't need a GPU cluster or a billion-parameter model. A Python script and a small set of arithmetic problems are enough to see the mechanics.

Checkpoint: Before building the verifier, you should be able to state the exact contract: require one answer field, extract one number, compare with tolerance, and fail closed on missing, duplicated, or malformed fields. If that contract feels vague, re-read the math verifier example above.

Exercise: Write a verifier that takes a model's raw text output and returns 1.0 if the answer is correct and 0.0 otherwise. Use the following prompt and ground-truth pairs:

PromptGround Truth
"A warehouse has 14 shelves with 12 boxes each. How many boxes total?"168
"A delivery truck carries 8 packages per trip. How many trips for 96 packages?"12
"An order subtotal is $47.50 with a 6% tax. What's the total?"50.35

Step 1: Implement extract_answer(text: str) -> str | None that accepts exactly one \boxed{...} answer field. Reject missing or duplicated fields rather than guessing which number the model intended as final.

Step 2: Implement verify(prompt: str, response: str, ground_truth: float) -> float that extracts the predicted answer, compares it to the ground truth with a small tolerance (e.g., abs(predicted - expected) < 1e-3), and returns 1.0 or 0.0.

Step 3: Test your verifier against these three model outputs for the first prompt:

  1. "There are 14 shelves and 12 boxes each. 14 × 12 = 168. The answer is 168\boxed{168}168​."
  2. "The total is 148\boxed{148}148​."
  3. "It might be 168\boxed{168}168​ or 148\boxed{148}148​."

Expected results: output 1 should return 1.0. Outputs 2 and 3 should return 0.0. The third response contains the correct value, but its final answer is ambiguous.

Here is a minimal solution you can extend:

a-tiny-verifier-lab.py
1import re 2 3def extract_answer(text: str) -> str | None: 4 boxed = re.findall(r"\\boxed\{(-?\d+(?:\.\d+)?)\}", text) 5 return boxed[0] if len(boxed) == 1 else None 6 7def verify(prompt: str, response: str, ground_truth: float) -> float: 8 answer = extract_answer(response) 9 if answer is None: 10 return 0.0 11 try: 12 predicted = float(answer) 13 except ValueError: 14 return 0.0 15 return 1.0 if abs(predicted - ground_truth) < 1e-3 else 0.0 16 17prompt = "A warehouse has 14 shelves with 12 boxes each. How many boxes total?" 18correct_score = verify(prompt, r"There are 14 shelves and 12 boxes each. 14 * 12 = 168. The answer is \boxed{168}.", 168) 19wrong_score = verify(prompt, r"The total is \boxed{148}.", 168) 20ambiguous_score = verify(prompt, r"It might be \boxed{168} or \boxed{148}.", 168) 21 22print(correct_score) 23print(wrong_score) 24print(ambiguous_score) 25print(f"correct_passes={correct_score == 1.0}") 26print(f"wrong_rejected={wrong_score == 0.0}") 27print(f"ambiguous_rejected={ambiguous_score == 0.0}")
Output
11.0 20.0 30.0 4correct_passes=True 5wrong_rejected=True 6ambiguous_rejected=True

Step 4 (optional): Add a format reward. Give +0.2 for using \boxed{} correctly, and +0.8 for a correct answer inside the box. What behavior does this mixed reward encourage? Does the model still get a positive reward if the box is present but the answer is wrong?

DeepSeek-R1-Zero combined accuracy rewards with format rewards targeting a parseable response structure.[2] Mixed rewards require careful testing so formatting doesn't dominate correctness.

RLVR and distillation

RLVR and distillation aren't mutually exclusive. DeepSeek-R1 used RL in its teacher pipeline, then fine-tuned smaller dense models on 800k generated training samples.[2] The systems distinction is clean: RLVR optimizes a policy against checks, while distillation trains a student from teacher outputs.

AspectRLVRDistillation
Training signalOnline reward from a verifierOffline supervision from teacher outputs
What it optimizesChecked success under a specified contractImitation of teacher behavior
Compute profileOnline sampling, verification, and RL updatesSupervised training over collected traces
DependencyNeeds a reliable verifierNeeds a useful teacher and clean trace data
Best useImprove checked outcomes for selected tasksTransfer teacher behavior into smaller models
Main failure modeReward hacking or sparse-credit collapseStudent inherits teacher blind spots and data coverage limits

DeepSeek-R1 makes this trade-off concrete. The paper reports that distillation into smaller Qwen and Llama models outperformed its reported RL experiment on smaller models in the evaluated setting.[2] That is evidence for testing distillation when a strong teacher exists, not a universal ranking of the two methods.

RLVR still matters because distillation doesn't optimize the student online against a verifier. If the teacher's checked outcomes are insufficient, improving that policy against well-tested verifiers is a different operation from transferring its sampled outputs.

Follow-up questions

Why might a lab train a teacher with RLVR before distilling it?

RLVR can raise the teacher's checked success rate through online sampling and reward. Once a teacher is strong on the relevant evaluations, distillation can turn sampled traces into supervised data for smaller models.

How would you design a verifier beyond math and code?

Start with structured outputs. A logistics model can emit JSON for route, capacity, cost, and cutoff time; the verifier can check schema validity, arithmetic, inventory constraints, and route feasibility. For retrieval or support tasks, a verifier might check cited order IDs against a database, confirm policy-section references, or reject answers that don't include required fields. The reward can be shaped, but every component should still fail closed.

What happens to general conversation ability during RLVR?

It can degrade if training only rewards narrow reasoning tasks. Mitigations include reference-policy KL, broad SFT data after rejection sampling, final preference-style alignment for helpfulness and harmlessness, and a regression suite that includes normal chat, writing, safety, and instruction following.

Could RLVR handle open-ended tasks with a strong enough verifier?

Sometimes. If you can reduce the task to objective properties, such as "all facts cite matching database rows" or "all required form fields are valid," RLVR can optimize those pieces. When the core quality judgment remains subjective, the setup becomes closer to RLHF or RLAIF than classic RLVR.

Mastery check

Key concepts

  • Define RLVR as reinforcement learning against programmatic verification, not learned human-preference scoring.
  • Decide whether a task belongs in RLVR, RLHF, DPO, Reinforcement Learning from AI Feedback (RLAIF), or a decomposed hybrid.
  • Explain why GRPO can remove the critic by comparing samples from the same prompt group.
  • Describe the self-verification, backtracking, and longer-response patterns observed during rule-reward RL without overstating what outcome reward proves.
  • Debug a verifier that creates format gaming, shortcut exploitation, sparse-reward collapse, or general-skill drift.
  • Explain why a training pipeline may use RLVR to improve checked teacher outcomes and distillation to make useful behavior cheaper to serve.

Evaluation rubric

  • Strong: clearly separates RLVR from RLHF, DPO, and distillation by naming the supervision source and bottleneck for each.
  • Strong: explains verifier design as a fail-closed contract, not a loose regex or grading heuristic.
  • Strong: can derive GRPO group-relative advantage and explain why weak reward spread or weak verifiers break learning.
  • Strong: treats visible self-verification and backtracking as observed generation patterns, not proof that an outcome verifier taught a general reasoning module.
  • Weak: assumes any subjective task can be forced into RLVR or confuses format accuracy with real task success.
  • Weak: treats RLVR and distillation as substitutes instead of teacher-improvement versus teacher-compression stages.

Common pitfalls

  • Rewarding output wrappers more than correctness. Symptom: nice \boxed{} formatting with bad answers. Fix: keep verifier fail-closed and make correctness dominate format bonuses.
  • Using RLVR on tasks that still need taste or policy judgment. Symptom: noisy verifier rules that secretly act like a brittle human preference model. Fix: switch to RLHF, DPO, or Reinforcement Learning from AI Feedback (RLAIF), or split task into smaller verifiable checks.
  • Ignoring broad capability drift after narrow reasoning RL. Symptom: math improves while normal chat or writing gets worse. Fix: keep KL anchoring, mix broader SFT data, and run non-reasoning regressions.
Next Step
Continue to Knowledge Distillation for LLMs

Verifiable rewards can improve checked outcomes in a teacher; distillation asks how to transfer useful teacher behavior into a smaller, cheaper model.

PreviousConstitutional AI & Red Teaming
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Lambert, N., et al. · 2024 · arXiv preprint

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., et al. · 2024

Proximal Policy Optimization Algorithms.

Schulman, J., et al. · 2017

Spurious Rewards: Rethinking Training Signals in RLVR

Shao, R., et al. · 2025