Understand RLVR, a post-training approach that uses programmatic verification instead of learned human-preference rewards to improve checked outcomes in math, code, and other contract-driven tasks.
Constitutional AI showed how written policy can guide AI feedback for behavior that still needs judgment. RLVR moves to a narrower setting: tasks where a program can check a concrete outcome.
Reinforcement Learning with Verifiable Rewards (RLVR) trains models from rewards assigned by executable checks rather than a learned preference model.[1] This chapter focuses on tasks where an answer, program, or structured action can be tested against a precise contract.
Imagine evaluating two model outputs. One drafts brand copy for a product page; feedback is subjective. The other computes whether a warehouse can ship 1,200 orders before a carrier cutoff; a calculator can check its arithmetic once the constraints are specified. The first requires judgment. The second has a verifiable component. Tulu 3 introduced the term RLVR for this type of post-training stage, and DeepSeek-R1 used rule-based rewards for reasoning tasks inside a broader multi-stage pipeline.[1][2]
To understand where RLVR fits, we need to look at how Large Language Models (LLMs) are traditionally trained to follow instructions.
Most instruction-following pipelines use Supervised Fine-Tuning (SFT), where the model imitates selected demonstrations. Think of SFT like showing a support model how an expert handles a damaged-item return, step by step. It teaches patterns present in those examples; it doesn't directly reward a newly sampled solution for satisfying an external checker.
One post-training route is Reinforcement Learning from Human Feedback (RLHF)[3], where a reward model is trained to predict preference labels. That signal fits qualities such as tone or helpfulness, but depends on label collection and reward-model validity outside its training examples.
Direct Preference Optimization (DPO)[4] removes the online reward-model optimization loop and trains directly from chosen/rejected pairs. Those pairs can come from humans, constitutions, or other pipelines, but they are still preference comparisons rather than executed correctness checks.
RLVR takes a different path. It specializes in tasks where success can be programmatically verified, such as final-answer math, code under tests, or formally checked proofs. It avoids collecting a preference label for each rollout, but its validity is only as strong as the specification, parser, test suite, and that assign reward.
| Method | Reward Signal | Scalability | Task Scope |
|---|---|---|---|
| SFT | Demonstration data | Limited by demonstration supply | Any task with demonstrations |
| RLHF | Learned reward from preference labels | Requires labels and reward-model checks | Open-ended behavior goals |
| DPO | Chosen/rejected pairs | Avoids online reward-model RL | Tasks expressible as preferences |
| RLVR | Executable verification | Repeatable if verifier is cheap | Outcomes covered by a verifier |
This is the central trade: RLVR gives up breadth to get a directly executable signal. In system design, the first question isn't "Which alignment method sounds modern?" It's "What behavior can this checker prove?" A logistics lab might verify route capacity and cutoff-time constraints by code. It still needs judgment for whether a delivery promise is acceptable to a customer.
In the simplest outcome-only setup, a verifiable reward function takes a problem and a generated response and returns a binary signal:
To see why this matters, imagine a prompt that asks: "Is 97 a prime number? Format your answer as or ."
The important limitation is visible immediately: for this prompt, Iterations 1 and 3 get the same reward. An outcome verifier does not tell the optimizer which correct answer used valid reasoning. Across many problems, reasoning patterns that correlate with correct outcomes may become more likely, but that is an empirical training result, not information contained in a single final-answer reward.
The following tiny check makes that limitation concrete. It verifies only the boxed final verdict, so an unsupported guess and a valid divisibility check receive identical reward.
1import re
2
3def verdict_reward(response: str, expected: str) -> float:
4 matches = re.findall(r"\\boxed\{(Yes|No)\}", response)
5 return 1.0 if matches == [expected] else 0.0
6
7rollouts = {
8 "lucky_reason": r"It ends in 7, so it must be prime. \boxed{Yes}",
9 "valid_check": r"Check divisors at most sqrt(97): 2, 3, 5, 7 fail. \boxed{Yes}",
10 "wrong_answer": r"97 is divisible by 3. \boxed{No}",
11}
12
13for name, response in rollouts.items():
14 print(f"{name}: reward={verdict_reward(response, 'Yes')}")1lucky_reason: reward=1.0
2valid_check: reward=1.0
3wrong_answer: reward=0.0Many RLVR systems start with an all-or-nothing outcome signal because a deterministic checker can compute it repeatedly. Others add rule-based terms, so reward remains executable without being purely binary. The trade-off is sparse credit assignment: if a response makes one late arithmetic error, a final-answer checker gives zero even if earlier steps were useful.
Key insight: Verifiable means checkable, not complete. A warehouse inventory checker can prove whether the reported total matches its inputs; it can't prove that a correct total came from a general strategy unless the evaluation tests that generalization.
In practice, systems often mix correctness rewards with other rule-based terms. DeepSeek-R1-Zero, for example, combined accuracy rewards with format rewards so outputs stayed easy to parse and verify.[2]
The simplest form is outcome supervision: score the final answer. An RLVR setup can do this with a rule-based checker. Related work has also trained learned verifiers to score final answers.[5] This gives sparse feedback.
Process supervision scores intermediate steps instead of only the final result. Lightman et al. compare outcome and process reward models trained with human labels on mathematical reasoning steps; that is adjacent to RLVR, not itself proof that each process score is programmatically verifiable.[6] A formal system can instead run a checker after each proof step. Either version offers more localized credit, but requires a trustworthy step-level signal.
This flowchart visualizes the difference between outcome and process supervision, highlighting the trade-off between scalability and signal quality:
Caption: Outcome supervision scores the final answer. Process supervision supplies step-level feedback, either from labels or from a checker when one exists, at higher construction cost.
We check if the final answer matches the ground truth, even if the reasoning path is different. The verifier below extracts exactly one boxed expression from the model's response and uses sympy to verify mathematical equivalence. This handles cases like versus correctly, while rejecting missing or ambiguous answer fields.
The following implementation is small enough to run locally, but it still captures the core production contract: extract the final answer, compare it symbolically, and fail closed on parsing errors. Production math verifiers add more normalization, timeouts, and adversarial parser tests.
1import sympy as sp
2
3def extract_boxed_exprs(text: str) -> list[str]:
4 marker = "\\boxed{"
5 expressions: list[str] = []
6 offset = 0
7 while (start := text.find(marker, offset)) != -1:
8 depth = 0
9 expr_chars: list[str] = []
10 for ch in text[start + len(marker):]:
11 if ch == "{":
12 depth += 1
13 expr_chars.append(ch)
14 elif ch == "}":
15 if depth == 0:
16 expressions.append("".join(expr_chars).strip())
17 offset = start + len(marker) + len(expr_chars) + 1
18 break
19 depth -= 1
20 expr_chars.append(ch)
21 else:
22 expr_chars.append(ch)
23 else:
24 return []
25 return expressions
26
27def math_verifier(problem: str, answer: str, ground_truth: str) -> float:
28 """Verify mathematical equivalence of the final answer."""
29 predicted_fields = extract_boxed_exprs(answer)
30 expected_fields = extract_boxed_exprs(ground_truth)
31 if len(predicted_fields) != 1 or len(expected_fields) != 1:
32 return 0.0
33 try:
34 predicted = sp.sympify(predicted_fields[0])
35 expected = sp.sympify(expected_fields[0])
36 return 1.0 if sp.simplify(predicted - expected) == 0 else 0.0
37 except (TypeError, ValueError, sp.SympifyError):
38 return 0.0
39
40equivalent_score = math_verifier("Compute half of one.", r"The answer is \boxed{0.5}.", r"\boxed{1/2}")
41wrong_score = math_verifier("Compute 12 times 14.", r"\boxed{148}", r"\boxed{168}")
42ambiguous_score = math_verifier("Compute 12 times 14.", r"Maybe \boxed{168} or \boxed{148}.", r"\boxed{168}")
43print(equivalent_score)
44print(wrong_score)
45print(ambiguous_score)
46print(f"equivalent_passes={equivalent_score == 1.0}")
47print(f"wrong_answer_rejected={wrong_score == 0.0}")
48print(f"ambiguous_answer_rejected={ambiguous_score == 0.0}")11.0
20.0
30.0
4equivalent_passes=True
5wrong_answer_rejected=True
6ambiguous_answer_rejected=TrueRunning generated code against a strong suite of hidden test cases checks concrete behavior. It doesn't prove full correctness unless the test suite is complete, but it is stronger than judging code by surface form. The local example below runs candidate code in a separate Python process and times out slow attempts. It isn't a security sandbox; production systems still need containers, seccomp, filesystem isolation, network controls, and resource limits.
1import json
2import subprocess
3import sys
4from dataclasses import dataclass
5
6@dataclass
7class TestCase:
8 inputs: tuple[int, ...]
9 expected: int
10
11def code_verifier(problem: str, code: str, test_cases: list[TestCase]) -> float:
12 """Execute candidate in a separate process and fail closed."""
13 cases = [(case.inputs, case.expected) for case in test_cases]
14 runner = (
15 code
16 + "\nimport json\n"
17 + f"cases = {cases!r}\n"
18 + "passed = all(solve(*inputs) == expected for inputs, expected in cases)\n"
19 + "print(json.dumps({'passed': passed}))\n"
20 )
21 try:
22 completed = subprocess.run(
23 [sys.executable, "-I", "-c", runner],
24 capture_output=True,
25 text=True,
26 timeout=1.0,
27 check=False,
28 )
29 except subprocess.TimeoutExpired:
30 return 0.0
31 if completed.returncode != 0:
32 return 0.0
33 try:
34 result = json.loads(completed.stdout)
35 except json.JSONDecodeError:
36 return 0.0
37 return 1.0 if result == {"passed": True} else 0.0
38
39cases = [TestCase((14, 12), 168), TestCase((8, 12), 96)]
40good = "def solve(a, b):\n return a * b\n"
41bad = "def solve(a, b):\n return a + b\n"
42
43good_score = code_verifier("multiply two integers", good, cases)
44bad_score = code_verifier("multiply two integers", bad, cases)
45print(good_score)
46print(bad_score)
47print(f"good_passes={good_score == 1.0}")
48print(f"bad_rejected={bad_score == 0.0}")11.0
20.0
3good_passes=True
4bad_rejected=TrueVisible tests aren't enough when the policy can memorize examples or infer shortcuts. In this miniature check, a candidate passes the two examples shown during development but fails a held-out case. A reward based only on visible tests would reinforce the wrong program.
1from collections.abc import Callable
2
3def shortcut_multiplier(a: int, b: int) -> int:
4 known = {(14, 12): 168, (8, 12): 96}
5 return known.get((a, b), 0)
6
7def pass_rate(fn: Callable[[int, int], int], cases: list[tuple[int, int, int]]) -> float:
8 passed = sum(fn(a, b) == expected for a, b, expected in cases)
9 return passed / len(cases)
10
11visible = [(14, 12, 168), (8, 12, 96)]
12held_out = [(7, 9, 63), (11, 13, 143)]
13
14print(f"visible_pass_rate={pass_rate(shortcut_multiplier, visible):.0%}")
15print(f"held_out_pass_rate={pass_rate(shortcut_multiplier, held_out):.0%}")
16print(f"ship={pass_rate(shortcut_multiplier, held_out) == 1.0}")1visible_pass_rate=100%
2held_out_pass_rate=0%
3ship=FalseProof assistants such as Lean or Isabelle can check that a candidate proof term satisfies a formal theorem under their kernel and imported definitions. In production, a verifier would wrap the theorem statement and candidate proof into a Lean script, then return 1.0 only if the prover accepts the full artifact.
The prover call itself is infrastructure-specific: it depends on your Lean version, package cache, sandbox, and timeout policy. Keep the reward conversion boring and testable, and keep the theorem-prover runner behind a clear boundary.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class ProverResult:
5 accepted: bool
6 stderr: str
7
8def proof_reward(result: ProverResult) -> float:
9 """Convert a theorem prover result into a verifiable reward."""
10 return 1.0 if result.accepted else 0.0
11
12accepted = proof_reward(ProverResult(accepted=True, stderr=""))
13rejected = proof_reward(ProverResult(accepted=False, stderr="unknown identifier"))
14print(accepted)
15print(rejected)
16print(f"accepted_maps_to_one={accepted == 1.0}")
17print(f"rejected_maps_to_zero={rejected == 0.0}")11.0
20.0
3accepted_maps_to_one=True
4rejected_maps_to_zero=TrueRLVR is powerful but narrow. It works best when the task has an objective spec and a verifier that fails closed. That rules out many open-ended tasks, or at least forces you to break them into smaller verifiable sub-problems. It's not a clean fit for:
DeepSeek-R1 used Group Relative Policy Optimization (GRPO), an algorithm introduced in DeepSeekMath[7] that avoids a learned value model by estimating advantages from a group of samples for the same prompt. DeepSeek-R1 later used GRPO in its reasoning-training stages.[2]
In standard Proximal Policy Optimization (PPO)[8], we need a Critic model (or Value Function) that predicts "how good is this state?". The Critic estimates the expected future reward .
In the terminal-reward setting common to LLM training (where the reward is observed only at the end of a sequence), the advantage simplifies to:
Where is the final reward and is the critic's value estimate. More generally, the advantage uses a temporal-difference error , often computed via GAE (Generalized Advantage Estimation) over the full sequence.
Training a critic alongside an LLM adds model memory and compute. With terminal rewards, it must also predict eventual success from partial generations, which becomes difficult when a long response can later revise an earlier mistake.[2]
GRPO eliminates the separate Critic model entirely. Instead of learning a value function that tries to predict how good any reasoning trajectory is, GRPO computes advantages relative to the other samples generated for the exact same prompt in the current batch.
The intuition is direct for verifiable tasks. For a given math problem or code task, sample several outputs. Some reach the checked outcome and receive high reward; others fail and receive low reward. The successful outputs are better than average for this sampled group. You don't need a learned critic that estimates difficulty across prompts; you do need reward variation inside each group.
Mathematically, for a prompt you sample a group of outputs . The advantage for output is the z-score of its reward within that group:
Positive advantages push the policy toward those trajectories; negative advantages push it away. Local normalization removes the separate critic-estimation problem and compares attempts at the same prompt difficulty. It doesn't guarantee stable training: if all sampled outputs receive the same reward, the group contributes no comparative signal.
Where is sample 's reward and normalization is done within the sampled group of size for the same prompt.
Here is a concrete example with samples for a single math prompt. The verifier returns 1.0 for a correct final answer and 0.0 for an incorrect one:
| Sample | Answer | Reward | Group Mean | Group Std | Advantage |
|---|---|---|---|---|---|
| 1 | 168 | 1.0 | 0.5 | 0.5 | +1.0 |
| 2 | 148 | 0.0 | 0.5 | 0.5 | -1.0 |
| 3 | 156 | 0.0 | 0.5 | 0.5 | -1.0 |
| 4 | 168 | 1.0 | 0.5 | 0.5 | +1.0 |
| 5 | 170 | 0.0 | 0.5 | 0.5 | -1.0 |
| 6 | 168 | 1.0 | 0.5 | 0.5 | +1.0 |
| 7 | 144 | 0.0 | 0.5 | 0.5 | -1.0 |
| 8 | 168 | 1.0 | 0.5 | 0.5 | +1.0 |
How to read the table: Four samples got the right answer (168), so their rewards are 1.0. Four got it wrong, so their rewards are 0.0. The group mean is , and the standard deviation is 0.5. Sample 1's advantage is . That positive advantage tells the optimizer to increase the probability of the that led to sample 1. Sample 2's advantage is , so the optimizer pushes down the probability of the tokens that led to sample 2.
Notice what the table hides as well as what it shows. Every correct answer gets the same positive advantage, and every wrong answer gets the same negative advantage. GRPO doesn't know why sample 1 was correct; it only knows that sample 1 was better than the average for this prompt. Over many prompts, the policy learns which reasoning patterns reliably produce above-average rewards.
At a high level, the update looks like a PPO-clipped objective with group-relative advantage and a KL penalty. The published GRPO objective is applied at generated-token level; this sequence-level sketch keeps the two ideas visible without reproducing all implementation detail:
Where is a sequence-level policy-ratio shorthand, is the normalized group advantage, is the clipping threshold, and controls the drift penalty toward the reference policy.
The Kullback-Leibler (KL) term penalizes drift from a reference policy. It can constrain movement toward a verifier exploit; it can't prove that the verifier represents the intended behavior.
One caveat falls straight out of the formula: if every sample in the group gets the same reward, the standard deviation is 0 and vanilla GRPO gets no learning signal from that group. That's one reason prompt difficulty, group size, and reward shaping matter in practice.
Common mistake: Saying that three equal rewards out of four collapse the GRPO signal. They don't: the one different outcome creates reward variance. Signal disappears when every sampled output gets the same reward, such as all failures on a prompt that is too hard or all passes on a prompt that is too easy. DeepSeek-R1 reports sampling 16 outputs per question in its reasoning RL setups.[2]
This diagnostic identifies which prompt groups supply comparative signal before an update. The mixed group is usable even though three out of four answers fail; the all-fail and all-pass groups carry no relative ranking information.
1import torch
2
3groups = {
4 "mixed": torch.tensor([0.0, 0.0, 0.0, 1.0]),
5 "all_fail": torch.tensor([0.0, 0.0, 0.0, 0.0]),
6 "all_pass": torch.tensor([1.0, 1.0, 1.0, 1.0]),
7}
8
9for name, rewards in groups.items():
10 std = float(rewards.std(unbiased=False))
11 informative = std > 0.0
12 print(f"{name}: std={std:.3f}, informative={informative}")1mixed: std=0.433, informative=True
2all_fail: std=0.000, informative=False
3all_pass: std=0.000, informative=FalseHere is a runnable scalar version of the GRPO math. It treats each sampled trajectory as one log-probability value, which is enough to test the group-normalized advantage, PPO-style clipping, and KL penalty. A production trainer does this at token level, batches generation across all samples, and stores the old-policy log probabilities during rollout.
1import torch
2
3def group_advantages(rewards: torch.Tensor) -> torch.Tensor:
4 std = rewards.std(unbiased=False)
5 if float(std) == 0.0:
6 return torch.zeros_like(rewards)
7 return (rewards - rewards.mean()) / (std + 1e-8)
8
9def clipped_grpo_loss(
10 log_probs_new: torch.Tensor,
11 log_probs_old: torch.Tensor,
12 log_probs_ref: torch.Tensor,
13 advantages: torch.Tensor,
14 epsilon: float = 0.2,
15 beta: float = 0.01,
16) -> torch.Tensor:
17 ratios = torch.exp(log_probs_new - log_probs_old)
18 clipped_ratios = torch.clamp(ratios, 1 - epsilon, 1 + epsilon)
19 policy_loss = -torch.min(ratios * advantages, clipped_ratios * advantages).mean()
20
21 log_ratio_ref = log_probs_new - log_probs_ref
22 ratio_ref = torch.exp(log_ratio_ref)
23 kl = (ratio_ref * log_ratio_ref - ratio_ref + 1).mean()
24 return policy_loss + beta * kl
25
26rewards = torch.tensor([1, 0, 0, 1, 0, 1, 0, 1], dtype=torch.float32)
27advantages = group_advantages(rewards)
28advantages_match = bool(torch.allclose(advantages, torch.tensor([1, -1, -1, 1, -1, 1, -1, 1], dtype=torch.float32)))
29
30trajectory_log_probs = torch.nn.Parameter(torch.zeros(8))
31old_log_probs = torch.zeros(8)
32reference_log_probs = torch.zeros(8)
33
34optimizer = torch.optim.SGD([trajectory_log_probs], lr=0.1)
35loss = clipped_grpo_loss(trajectory_log_probs, old_log_probs, reference_log_probs, advantages)
36loss.backward()
37
38# Positive-advantage trajectories should be pushed up; negative ones down.
39grad_signs_ok = (
40 trajectory_log_probs.grad is not None
41 and trajectory_log_probs.grad[0] < 0
42 and trajectory_log_probs.grad[1] > 0
43)
44
45optimizer.step()
46step_direction_ok = trajectory_log_probs[0] > 0 and trajectory_log_probs[1] < 0
47
48print("advantages:", [round(float(x), 1) for x in advantages])
49print("updated log probs:", [round(float(x), 3) for x in trajectory_log_probs.detach()])
50print(f"advantages_match={advantages_match}")
51print(f"grad_signs_ok={bool(grad_signs_ok)}")
52print(f"step_direction_ok={bool(step_direction_ok)}")1advantages: [1.0, -1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0]
2updated log probs: [0.013, -0.013, -0.013, 0.013, -0.013, 0.013, -0.013, 0.013]
3advantages_match=True
4grad_signs_ok=True
5step_direction_ok=TrueGRPO is an optimizer, not a synonym for RLVR. Tülu 3 trained its RLVR stage with PPO, while DeepSeekMath introduced GRPO as a PPO variant and DeepSeek-R1 used GRPO in its reasoning stages.[1][7][2] Both optimizers can consume rule-based verifier rewards.
| Feature | PPO-style policy optimization | GRPO |
|---|---|---|
| Baseline | Predicted by a separate Critic model | Mean reward of the sampled group |
| Additional value-model cost | Trains a critic/value model | Avoids a separate critic |
| Main estimation risk | Critic must predict returns | Groups with equal rewards give no relative signal |
| Compatible reward signals | Learned or rule-based reward | Learned or rule-based reward, including binary checks |
Self-check: In the worked table above, what would happen to the advantages if all 8 samples got the same reward? Why does that make prompt difficulty and group size important in practice?
No need to hold a large Critic model in GPU memory.
The group mean is local to the current prompt. GRPO avoids training a critic to predict returns from partial reasoning traces, though it still depends on reward spread, a sound verifier, and well-tuned optimization.
Binary rewards make the group comparison easy to read: when a group contains both passes and failures, passes receive positive advantage and failures negative advantage. If all candidates pass or all fail, their advantages are zero. No value function is needed, but prompt selection and exploration must produce mixed outcomes often enough to learn.
The success of RLVR depends on the pipeline, not only the algorithm. DeepSeek-R1[2] used a specific multi-stage process to bootstrap reasoning.
This diagram summarizes DeepSeek-R1's reported multi-stage pipeline, where supervised fine-tuning and reinforcement learning alternate to improve checked reasoning outcomes, readability, and broader assistant behavior:
Caption: DeepSeek-R1's reported pipeline. Stage 1 targets readable cold-start behavior. Stage 2 applies GRPO with rule-based rewards on reasoning tasks, including a language-consistency reward. Stage 3 collects filtered traces into SFT data and mixes in broader instruction data. Stage 4 combines reasoning rewards with general-behavior reward models.
DeepSeek-R1-Zero showed that DeepSeek-V3-Base could improve on reported reasoning evaluations through rule-based RL without a preliminary SFT phase.[2] That result doesn't mean every base model or verifier will train usefully from zero-reward-heavy rollouts.
DeepSeek still added cold-start data for DeepSeek-R1 because R1-Zero had poor readability and mixed languages. The paper describes collecting thousands of long-CoT examples, filtering for a readable response pattern, and using them to initialize the model before RL.[2] In other projects, cold start also helps when the base model's initial pass rate is so low that almost every rollout gets zero reward.
This stage trains on math, code, and logic prompts using GRPO with rule-based rewards. In DeepSeek-R1-Zero, those rewards were accuracy plus format. In DeepSeek-R1, the first RL stage also added a language-consistency reward to reduce mixed-language chains of thought, with the paper reporting a slight performance trade-off in its ablation.[2]
For rule-graded prompts, a response that fails the checked outcome doesn't receive the accuracy reward no matter how persuasive its prose is. That still leaves coverage risk: a pass can reflect a shortcut if the verifier or evaluation set fails to test it.
Once the RL policy is producing higher-scoring traces, they can become new demonstrations. DeepSeek-R1 used rejection sampling, retaining checked-correct responses for rule-gradable data and using DeepSeek-V3 judgments for some expanded reasoning data.[2]
In the paper, that stage produced about 600k reasoning samples plus about 200k non-reasoning samples, for roughly 800k total.[2] This turns exploratory RL behavior into a stable supervised dataset and helps recover capabilities that pure reasoning RL doesn't optimize for, such as writing quality and everyday instruction following.
Pure reasoning RL can produce a model that's strong on benchmarks but rough around the edges. In the final stage, DeepSeek runs RL again, this time on a broader mix of scenarios.[2]
This uses a mix of:
This final stage is intended to combine reasoning performance with broader helpfulness and harmlessness objectives; those outcomes must still be measured rather than assumed.
Outcome-checked RL doesn't label each reasoning step. In DeepSeek-R1-Zero, the authors report longer responses and visible patterns such as verification, reflection, and exploration of alternatives during training.[2] Treat those as observed output behaviors, not proof of an internal reasoning mechanism.
An output may pause and check its work.
"Wait, let me double-check that calculation. is , not . I made a mistake."
In a logistics setting, the same behavior appears when a model verifies a shipping quote:
"The subtotal is $47.50 and tax is 6%. 47.50 × 0.06 = 2.85, so the total should be $50.35. Let me confirm: 47.50 + 2.85 = 50.35. Correct."
DeepSeek-R1-Zero wasn't trained from step-level labels saying "write let me check here." Its paper reports a sharp increase in the use of "wait" during reflection later in training.[2] The cautious interpretation is that RL changed the distribution of generated traces and that some reflective-looking traces co-occurred with improved checked outcomes. An outcome-only verifier doesn't establish why those traces improved.
An output may abandon one approach and try another.
"This approach using geometry seems too complex. Let me try using coordinate algebra instead."
This can look search-like on the surface, but the training loop is still autoregressive generation plus reward-weighted updates, not an explicit tree-search controller. A verifier credits the final checked outcome; it doesn't separately establish that the pivot was necessary.
In DeepSeek-R1-Zero, average response length jumped sharply during training alongside reported accuracy gains.[2] This connects RLVR to test-time compute because trained policies may emit longer traces on reasoning tasks. Longer traces aren't automatically better, though; their value must be measured against correctness, latency, and token cost.
A useful self-check: suppose a model trained with outcome rewards starts saying "Wait, let me double-check that" before finalizing answers. What can you conclude from the output pattern, and what would require separate evaluation? The reward confirms correct final answers, not the necessity or faithfulness of the written reflection.
Research caution: It remains open how much RLVR creates new problem-solving behavior versus eliciting behavior already likely under the base model. Shao et al. found that random and format-only rewards substantially improved MATH-500 results for Qwen2.5-Math in their setup, while comparable spurious rewards gave little benefit or harmed Llama3 and OLMo2 variants.[9] They connect this result to GRPO clipping and model-specific high-prior behaviors. Practical takeaway: compare against weak or spurious-reward baselines and validate across model families and held-out tasks.
Any time you optimize a metric, a policy can exploit gaps in that metric. RLVR is no exception. If the verifier checks only final answers, a memorized answer, leaked target, or weak parser can receive reward without demonstrating general solution ability. DeepSeek-R1 explicitly identifies reliable reward construction as a limitation once tasks cannot be graded by dependable rules.[2]
Don't train against a verifier before adversarially testing it. Any malformed output or shortcut that earns reward can be reinforced by the optimization loop.
A format-heavy reward can favor output wrappers over correctness. For example, if the verifier gives substantial credit for \boxed{...} before checking the value, it can reinforce neatly formatted wrong answers. A parser that accepts any matching line also risks rewarding output that contains several contradictory answers.
The symptom is high format compliance but low task accuracy. The cause is that the reward function prizes formatting before correctness. The fix is to make correctness dominant and fail closed on wrong or ambiguous contents, even if the wrapper is present.
This example compares a broken format-first reward with a correctness-gated version. A wrong boxed answer should never get more reward than an unboxed correct answer just because it is easy to parse.
1import re
2
3def boxed_value(response: str) -> int | None:
4 matches = re.findall(r"\\boxed\{(\d+)\}", response)
5 return int(matches[0]) if len(matches) == 1 else None
6
7def broken_reward(response: str, expected: int) -> float:
8 value = boxed_value(response)
9 return 0.7 if value is not None else (0.3 if str(expected) in response else 0.0)
10
11def gated_reward(response: str, expected: int) -> float:
12 value = boxed_value(response)
13 return 1.0 if value == expected else 0.0
14
15wrong_boxed = r"The answer is \boxed{148}."
16right_plain = "The answer is 168."
17
18print(f"broken_prefers_wrong={broken_reward(wrong_boxed, 168) > broken_reward(right_plain, 168)}")
19print(f"gated_wrong_boxed={gated_reward(wrong_boxed, 168)}")1broken_prefers_wrong=True
2gated_wrong_boxed=0.0If a training set has shortcuts (for example, one multiple-choice position is correct much more often), optimizing checked training reward can favor that shortcut. A code verifier that treats execution errors as passing outcomes creates an even more direct loophole.
To prevent policies from exploiting the verifier, several defensive practices help during training:
sympy) and isolated execution environments instead of brittle regex parsing. Ensure the verifier fails closed (returns 0 on any parsing error).
Assuming RLVR works for any task. The symptom is a reward function that secretly depends on taste, safety judgment, or fuzzy grader text. The cause is forcing an objective RL method onto a subjective task. The fix is to ask whether a deterministic checker, theorem prover, unit test suite, database constraint, or schema validator can grade the output. If not, use RLHF, DPO, Reinforcement Learning from AI Feedback (RLAIF), or decompose the task into smaller verifiable checks.
Treating cold-start SFT as either mandatory or useless. DeepSeek-R1-Zero improved reported reasoning-task results from DeepSeek-V3-Base without a preliminary SFT phase.[2] DeepSeek-R1 still used cold-start SFT because readability and language consistency mattered. The fix is to measure the base model's initial pass rate and output quality. If almost every rollout gets zero reward or the traces are unreadable, cold-start data may make RL more tractable.
Celebrating format accuracy. A verifier that rewards \boxed{} too strongly can teach box-writing instead of problem-solving. The fix is to keep correctness dominant, run adversarial parser tests, and track real task accuracy separately from format compliance.
Ignoring general capability drift. A model trained heavily on math or code RLVR can become better at those tasks while getting worse at normal instruction following. The fix is to use KL anchoring, mix in general instruction data during later SFT stages, and run broad evaluations such as writing, safety, and everyday chat alongside math or code metrics.
This release gate catches a reasoning gain that comes with an unacceptable support-quality regression. The numbers are illustrative; each product should define thresholds before training.
1baseline = {"checked_reasoning": 0.61, "support_helpfulness": 0.92, "false_refusal": 0.04}
2candidate = {"checked_reasoning": 0.74, "support_helpfulness": 0.81, "false_refusal": 0.13}
3
4violations = []
5if candidate["checked_reasoning"] <= baseline["checked_reasoning"]:
6 violations.append("no reasoning gain")
7if candidate["support_helpfulness"] < baseline["support_helpfulness"] - 0.03:
8 violations.append("support helpfulness regressed")
9if candidate["false_refusal"] > baseline["false_refusal"] + 0.02:
10 violations.append("false refusals increased")
11
12print(f"reasoning_gain={candidate['checked_reasoning'] - baseline['checked_reasoning']:+.0%}")
13print(f"ship={not violations}")
14print(violations)1reasoning_gain=+13%
2ship=False
3['support helpfulness regressed', 'false refusals increased']Treating RLVR and distillation as competing choices. RLVR can improve checked outcomes for a teacher; distillation transfers sampled teacher behavior into a smaller model. The fix is to decide which bottleneck you have: if the teacher fails verifiable evaluations, train against better checks; if a capable teacher is too expensive to serve, consider distillation.
The best way to understand RLVR is to build a verifier yourself. You don't need a GPU cluster or a billion-parameter model. A Python script and a small set of arithmetic problems are enough to see the mechanics.
Checkpoint: Before building the verifier, you should be able to state the exact contract: require one answer field, extract one number, compare with tolerance, and fail closed on missing, duplicated, or malformed fields. If that contract feels vague, re-read the math verifier example above.
Exercise: Write a verifier that takes a model's raw text output and returns 1.0 if the answer is correct and 0.0 otherwise. Use the following prompt and ground-truth pairs:
| Prompt | Ground Truth |
|---|---|
| "A warehouse has 14 shelves with 12 boxes each. How many boxes total?" | 168 |
| "A delivery truck carries 8 packages per trip. How many trips for 96 packages?" | 12 |
| "An order subtotal is $47.50 with a 6% tax. What's the total?" | 50.35 |
Step 1: Implement extract_answer(text: str) -> str | None that accepts exactly one \boxed{...} answer field. Reject missing or duplicated fields rather than guessing which number the model intended as final.
Step 2: Implement verify(prompt: str, response: str, ground_truth: float) -> float that extracts the predicted answer, compares it to the ground truth with a small tolerance (e.g., abs(predicted - expected) < 1e-3), and returns 1.0 or 0.0.
Step 3: Test your verifier against these three model outputs for the first prompt:
Expected results: output 1 should return 1.0. Outputs 2 and 3 should return 0.0. The third response contains the correct value, but its final answer is ambiguous.
Here is a minimal solution you can extend:
1import re
2
3def extract_answer(text: str) -> str | None:
4 boxed = re.findall(r"\\boxed\{(-?\d+(?:\.\d+)?)\}", text)
5 return boxed[0] if len(boxed) == 1 else None
6
7def verify(prompt: str, response: str, ground_truth: float) -> float:
8 answer = extract_answer(response)
9 if answer is None:
10 return 0.0
11 try:
12 predicted = float(answer)
13 except ValueError:
14 return 0.0
15 return 1.0 if abs(predicted - ground_truth) < 1e-3 else 0.0
16
17prompt = "A warehouse has 14 shelves with 12 boxes each. How many boxes total?"
18correct_score = verify(prompt, r"There are 14 shelves and 12 boxes each. 14 * 12 = 168. The answer is \boxed{168}.", 168)
19wrong_score = verify(prompt, r"The total is \boxed{148}.", 168)
20ambiguous_score = verify(prompt, r"It might be \boxed{168} or \boxed{148}.", 168)
21
22print(correct_score)
23print(wrong_score)
24print(ambiguous_score)
25print(f"correct_passes={correct_score == 1.0}")
26print(f"wrong_rejected={wrong_score == 0.0}")
27print(f"ambiguous_rejected={ambiguous_score == 0.0}")11.0
20.0
30.0
4correct_passes=True
5wrong_rejected=True
6ambiguous_rejected=TrueStep 4 (optional): Add a format reward. Give +0.2 for using \boxed{} correctly, and +0.8 for a correct answer inside the box. What behavior does this mixed reward encourage? Does the model still get a positive reward if the box is present but the answer is wrong?
DeepSeek-R1-Zero combined accuracy rewards with format rewards targeting a parseable response structure.[2] Mixed rewards require careful testing so formatting doesn't dominate correctness.
RLVR and distillation aren't mutually exclusive. DeepSeek-R1 used RL in its teacher pipeline, then fine-tuned smaller dense models on 800k generated training samples.[2] The systems distinction is clean: RLVR optimizes a policy against checks, while distillation trains a student from teacher outputs.
| Aspect | RLVR | Distillation |
|---|---|---|
| Training signal | Online reward from a verifier | Offline supervision from teacher outputs |
| What it optimizes | Checked success under a specified contract | Imitation of teacher behavior |
| Compute profile | Online sampling, verification, and RL updates | Supervised training over collected traces |
| Dependency | Needs a reliable verifier | Needs a useful teacher and clean trace data |
| Best use | Improve checked outcomes for selected tasks | Transfer teacher behavior into smaller models |
| Main failure mode | Reward hacking or sparse-credit collapse | Student inherits teacher blind spots and data coverage limits |
DeepSeek-R1 makes this trade-off concrete. The paper reports that distillation into smaller Qwen and Llama models outperformed its reported RL experiment on smaller models in the evaluated setting.[2] That is evidence for testing distillation when a strong teacher exists, not a universal ranking of the two methods.
RLVR still matters because distillation doesn't optimize the student online against a verifier. If the teacher's checked outcomes are insufficient, improving that policy against well-tested verifiers is a different operation from transferring its sampled outputs.
RLVR can raise the teacher's checked success rate through online sampling and reward. Once a teacher is strong on the relevant evaluations, distillation can turn sampled traces into supervised data for smaller models.
Start with structured outputs. A logistics model can emit JSON for route, capacity, cost, and cutoff time; the verifier can check schema validity, arithmetic, inventory constraints, and route feasibility. For retrieval or support tasks, a verifier might check cited order IDs against a database, confirm policy-section references, or reject answers that don't include required fields. The reward can be shaped, but every component should still fail closed.
It can degrade if training only rewards narrow reasoning tasks. Mitigations include reference-policy KL, broad SFT data after rejection sampling, final preference-style alignment for helpfulness and harmlessness, and a regression suite that includes normal chat, writing, safety, and instruction following.
Sometimes. If you can reduce the task to objective properties, such as "all facts cite matching database rows" or "all required form fields are valid," RLVR can optimize those pieces. When the core quality judgment remains subjective, the setup becomes closer to RLHF or RLAIF than classic RLVR.
\boxed{} formatting with bad answers. Fix: keep verifier fail-closed and make correctness dominate format bonuses.Tülu 3: Pushing Frontiers in Open Language Model Post-Training
Lambert, N., et al. · 2024 · arXiv preprint
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI · 2025
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
Training Verifiers to Solve Math Word Problems (GSM8K).
Cobbe, K., et al. · 2021
Let's Verify Step by Step.
Lightman, H., et al. · 2023 · ICLR
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., et al. · 2024
Proximal Policy Optimization Algorithms.
Schulman, J., et al. · 2017
Spurious Rewards: Rethinking Training Signals in RLVR
Shao, R., et al. · 2025