LearnAdvanced Training & AdaptationRLVR & Verifiable Rewards

⚡HardFine-Tuning & Training

RLVR & Verifiable Rewards

Understand RLVR, a post-training approach that uses programmatic verification instead of learned human-preference rewards to improve checked outcomes in math, code, and other contract-driven tasks.

40 min read

Learning path

Step 108 of 158 in the full curriculum

Constitutional AI & Red Teaming Knowledge Distillation for LLMs

Some post-training targets still need judgment against written policy. RLVR moves to a narrower setting: tasks where a program can check a concrete outcome.

Reinforcement Learning with Verifiable Rewards (RLVR) trains models from rewards assigned by executable checks rather than a learned preference model.^{[1]Reference 1Tülu 3: Pushing Frontiers in Open Language Model Post-Traininghttps://arxiv.org/abs/2411.15124} Focus on tasks where an answer, program, or structured action can be tested against a precise contract.

Two model outputs show the boundary. One drafts an incident update, where feedback is partly subjective. The other writes a Python function that either passes unit tests or fails them. The incident update requires judgment; the function has a verifiable component. Tulu 3 introduced the term RLVR for this type of post-training stage, and DeepSeek-R1 used rule-based rewards for reasoning tasks inside a broader multi-stage pipeline.^{[1]Reference 1Tülu 3: Pushing Frontiers in Open Language Model Post-Traininghttps://arxiv.org/abs/2411.15124}^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

RLVR training pipeline with prompt, candidate rollouts, automatic verifier, and policy update. — The RLVR loop: sample candidate answers, grade them with an executable verifier, then update the policy. PPO, GRPO, and other RL optimizers can consume verifiable rewards.

The alignment space

RLVR sits in the post-training stack next to SFT, RLHF, and DPO. Large Language Models (LLMs) usually learn instruction following through demonstration and preference signals first.

Most instruction-following pipelines use Supervised Fine-Tuning (SFT), where the model imitates selected demonstrations. In a support workflow, SFT shows a model how an expert handles a damaged-item return step by step. It teaches patterns present in those examples; it doesn't directly reward a newly sampled solution for satisfying an external checker.

One post-training route is Reinforcement Learning from Human Feedback (RLHF)^{[3]Reference 3Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}, where a reward model is trained to predict preference labels. That signal fits qualities such as tone or helpfulness, but depends on label collection and reward-model validity outside its training examples.

Direct Preference Optimization (DPO)^{[4]Reference 4Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290} removes the online reward-model optimization loop and trains directly from chosen/rejected pairs. Those pairs can come from humans, constitutions, or other pipelines, but they are still preference comparisons rather than executed correctness checks.

RLVR takes a different path. It specializes in tasks where success can be programmatically verified, such as final-answer math, code under tests, or formally checked proofs. It avoids collecting a preference label for each rollout, but its validity is only as strong as the specification, parser, test suite, and sandbox that assign reward.

Method	Reward Signal	Scalability	Task Scope
SFT	Demonstration data	Limited by demonstration supply	Any task with demonstrations
RLHF	Learned reward from preference labels	Requires labels and reward-model checks	Open-ended behavior goals
DPO	Chosen/rejected pairs	Avoids online reward-model RL	Tasks expressible as preferences
RLVR	Executable verification	Repeatable if verifier is cheap	Outcomes covered by a verifier

This is the central trade: RLVR gives up breadth to get a directly executable signal. In system design, the first question isn't "Which alignment method sounds modern?" It's "What behavior can this checker prove?" A coding lab might verify unit tests, type checks, and exact-output constraints by code. It still needs judgment for whether the solution is maintainable, secure, or clear.

Decision router for choosing SFT, RLHF, DPO, or RLVR based on whether the task has demonstrations, preference comparisons, or a programmatic verifier. — The first design question isn't which post-training acronym is newest. It's what kind of supervision the task can honestly provide.

Defining verifiable rewards

In the simplest outcome-only setup, a verifiable reward function $r(x, y)$ takes a problem $x$ and a generated response $y$ and returns a binary signal:

r(x, y) = \begin{cases} 1 & \text{if } y \text{ is correct (verified)} \\ 0 & \text{if } y \text{ is incorrect} \end{cases}

Use a prompt with an unambiguous final answer: "Is 97 a prime number? Format your answer as $\boxed{Yes}$ or $\boxed{No}$ ."

Iteration 1: The model says "Yes, because it ends in 7." The final answer happens to be correct, so an outcome-only verifier returns 1.0, even though the stated reason is invalid.
Iteration 2: The model says "No, 97/3 = 32.3." The answer is wrong, so the verifier returns 0.0.
Iteration 3: The model says "97 is prime because it's not divisible by 2, 3, 5, or 7." The answer is correct and the reasoning is solid, so the verifier returns 1.0.

The important limitation is visible immediately: for this prompt, Iterations 1 and 3 get the same reward. An outcome verifier does not tell the optimizer which correct answer used valid reasoning. Across many problems, reasoning patterns that correlate with correct outcomes may become more likely, but that's an empirical training result, not information contained in a single final-answer reward.

This tiny check makes that limitation concrete. It verifies only the boxed final verdict, so an unsupported guess and a valid divisibility check receive identical reward.

outcome-reward-blind-spot.py

import re

def verdict_reward(response: str, expected: str) -> float:
    matches = re.findall(r"\\boxed\{(Yes|No)\}", response)
    return 1.0 if matches == [expected] else 0.0

rollouts = {
    "lucky_reason": r"It ends in 7, so it must be prime. \boxed{Yes}",
    "valid_check": r"Check divisors at most sqrt(97): 2, 3, 5, 7 fail. \boxed{Yes}",
    "wrong_answer": r"97 is divisible by 3. \boxed{No}",
}

for name, response in rollouts.items():
    print(f"{name}: reward={verdict_reward(response, 'Yes')}")

Outcome reward blind spot

lucky_reason: reward=1.0
valid_check: reward=1.0
wrong_answer: reward=0.0

Many RLVR systems start with an all-or-nothing outcome signal because a deterministic checker can compute it repeatedly. Others add rule-based terms, so reward remains executable without being purely binary. Sparse credit assignment is the cost: if a response makes one late arithmetic error, a final-answer checker gives zero even if earlier steps were useful.

Checkable isn't complete: Verifiable means checkable, not complete. A unit-test verifier can prove whether one generated function matches the tested cases; it can't prove that a correct answer came from a general strategy unless the evaluation tests that generalization.

Verifier contract diagram showing raw model output passing through extraction, normalization, deterministic checking, and fail-closed reward assignment. — A verifier is a production contract: extract answer, normalize it, check it deterministically, and fail closed on ambiguity.

In practice, systems often mix correctness rewards with other rule-based terms. DeepSeek-R1-Zero, for example, combined accuracy rewards with format rewards so outputs stayed easy to parse and verify.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

Outcome vs. process supervision

The simplest form is outcome supervision: score the final answer. An RLVR setup can do this with a rule-based checker. Related work has also trained learned verifiers to score final answers.^{[5]Reference 5Training Verifiers to Solve Math Word Problems (GSM8K).https://arxiv.org/abs/2110.14168} This gives sparse feedback.

Process supervision scores intermediate steps instead of only the final result. Lightman et al. compare outcome and process reward models trained with human labels on mathematical reasoning steps; that's adjacent to RLVR, not itself proof that each process score is programmatically verifiable.^{[6]Reference 6Let's Verify Step by Step.https://arxiv.org/abs/2305.20050} A formal system can instead run a checker after each proof step. Either version offers more localized credit, but requires a trustworthy step-level signal.

Outcome and process supervision differ in where feedback lands and how expensive the signal is to build:

Diagram showing Prompt + candidate response, Outcome verifier, Final answer reward 0 or 1, and Cheap and scalable. — Prompt + candidate response, Outcome verifier, Final answer reward 0 or 1, and Cheap and scalable.

Caption: Outcome supervision scores the final answer. Process supervision supplies step-level feedback, either from labels or from a checker when one exists, at higher construction cost.

Examples of verifiers

Mathematics

We check if the final answer matches the ground truth, even if the reasoning path is different. The verifier below extracts exactly one boxed expression from the model's response and uses sympy to verify mathematical equivalence. This handles cases like $1/2$ versus $0.5$ correctly, while rejecting missing or ambiguous answer fields.

This implementation is small enough to run locally, but it still captures the core production contract: extract the final answer, compare it symbolically, and fail closed on parsing errors. Production math verifiers add more normalization, timeouts, and adversarial parser tests.

mathematics.py

import sympy as sp

def extract_boxed_exprs(text: str) -> list[str]:
    marker = "\\boxed{"
    expressions: list[str] = []
    offset = 0
    while (start := text.find(marker, offset)) != -1:
        depth = 0
        expr_chars: list[str] = []
        for ch in text[start + len(marker):]:
            if ch == "{":
                depth += 1
                expr_chars.append(ch)
            elif ch == "}":
                if depth == 0:
                    expressions.append("".join(expr_chars).strip())
                    offset = start + len(marker) + len(expr_chars) + 1
                    break
                depth -= 1
                expr_chars.append(ch)
            else:
                expr_chars.append(ch)
        else:
            return []
    return expressions

def math_verifier(problem: str, answer: str, ground_truth: str) -> float:
    """Verify mathematical equivalence of the final answer."""
    predicted_fields = extract_boxed_exprs(answer)
    expected_fields = extract_boxed_exprs(ground_truth)
    if len(predicted_fields) != 1 or len(expected_fields) != 1:
        return 0.0
    try:
        predicted = sp.sympify(predicted_fields[0])
        expected = sp.sympify(expected_fields[0])
        return 1.0 if sp.simplify(predicted - expected) == 0 else 0.0
    except (TypeError, ValueError, sp.SympifyError):
        return 0.0

equivalent_score = math_verifier("Compute half of one.", r"The answer is \boxed{0.5}.", r"\boxed{1/2}")
wrong_score = math_verifier("Compute 12 times 14.", r"\boxed{148}", r"\boxed{168}")
ambiguous_score = math_verifier("Compute 12 times 14.", r"Maybe \boxed{168} or \boxed{148}.", r"\boxed{168}")
print(equivalent_score)
print(wrong_score)
print(ambiguous_score)
print(f"equivalent_passes={equivalent_score == 1.0}")
print(f"wrong_answer_rejected={wrong_score == 0.0}")
print(f"ambiguous_answer_rejected={ambiguous_score == 0.0}")

Output

1.0
0.0
0.0
equivalent_passes=True
wrong_answer_rejected=True
ambiguous_answer_rejected=True

Code generation

Running generated code against a strong suite of hidden test cases checks concrete behavior. It doesn't prove full correctness unless the test suite is complete, but it's stronger than judging code by surface form. The local example below runs candidate code in a separate Python process and times out slow attempts. It isn't a security sandbox; production systems still need containers, seccomp, filesystem isolation, network controls, and resource limits.

code-generation.py

import json
import subprocess
import sys
from dataclasses import dataclass

@dataclass
class TestCase:
    inputs: tuple[int, ...]
    expected: int

def code_verifier(problem: str, code: str, test_cases: list[TestCase]) -> float:
    """Execute candidate in a separate process and fail closed."""
    cases = [(case.inputs, case.expected) for case in test_cases]
    runner = (
        code
        + "\nimport json\n"
        + f"cases = {cases!r}\n"
        + "passed = all(solve(*inputs) == expected for inputs, expected in cases)\n"
        + "print(json.dumps({'passed': passed}))\n"
    )
    try:
        completed = subprocess.run(
            [sys.executable, "-I", "-c", runner],
            capture_output=True,
            text=True,
            timeout=1.0,
            check=False,
        )
    except subprocess.TimeoutExpired:
        return 0.0
    if completed.returncode != 0:
        return 0.0
    try:
        result = json.loads(completed.stdout)
    except json.JSONDecodeError:
        return 0.0
    return 1.0 if result == {"passed": True} else 0.0

cases = [TestCase((14, 12), 168), TestCase((8, 12), 96)]
good = "def solve(a, b):\n    return a * b\n"
bad = "def solve(a, b):\n    return a + b\n"

good_score = code_verifier("multiply two integers", good, cases)
bad_score = code_verifier("multiply two integers", bad, cases)
print(good_score)
print(bad_score)
print(f"good_passes={good_score == 1.0}")
print(f"bad_rejected={bad_score == 0.0}")

Output

1.0
0.0
good_passes=True
bad_rejected=True

Visible tests aren't enough when the policy can memorize examples or infer shortcuts. In this miniature check, a candidate passes the two examples shown during development but fails a held-out case. A reward based only on visible tests would reinforce the wrong program.

held-out-code-tests.py

from collections.abc import Callable

def shortcut_multiplier(a: int, b: int) -> int:
    known = {(14, 12): 168, (8, 12): 96}
    return known.get((a, b), 0)

def pass_rate(fn: Callable[[int, int], int], cases: list[tuple[int, int, int]]) -> float:
    passed = sum(fn(a, b) == expected for a, b, expected in cases)
    return passed / len(cases)

visible = [(14, 12, 168), (8, 12, 96)]
held_out = [(7, 9, 63), (11, 13, 143)]

print(f"visible_pass_rate={pass_rate(shortcut_multiplier, visible):.0%}")
print(f"held_out_pass_rate={pass_rate(shortcut_multiplier, held_out):.0%}")
print(f"ship={pass_rate(shortcut_multiplier, held_out) == 1.0}")

Held-out test check

visible_pass_rate=100%
held_out_pass_rate=0%
ship=False

Formal logic

Proof assistants such as Lean or Isabelle can check that a candidate proof term satisfies a formal theorem under their kernel and imported definitions. In production, a verifier would wrap the theorem statement and candidate proof into a Lean script, then return 1.0 only if the prover accepts the full artifact.

The prover call itself is infrastructure-specific: it depends on your Lean version, package cache, sandbox, and timeout policy. Keep the reward conversion boring and testable, and keep the theorem-prover runner behind a clear boundary.

formal-logic.py

from dataclasses import dataclass

@dataclass(frozen=True)
class ProverResult:
    accepted: bool
    stderr: str

def proof_reward(result: ProverResult) -> float:
    """Convert a theorem prover result into a verifiable reward."""
    return 1.0 if result.accepted else 0.0

accepted = proof_reward(ProverResult(accepted=True, stderr=""))
rejected = proof_reward(ProverResult(accepted=False, stderr="unknown identifier"))
print(accepted)
print(rejected)
print(f"accepted_maps_to_one={accepted == 1.0}")
print(f"rejected_maps_to_zero={rejected == 0.0}")

Output

1.0
0.0
accepted_maps_to_one=True
rejected_maps_to_zero=True

What can't be verified?

RLVR is effective but narrow. It works best when the task has an objective spec and a verifier that fails closed. That rules out many open-ended tasks, or at least forces you to break them into smaller verifiable sub-problems. It's not a clean fit for:

Open-ended writing: "Write three status-page headlines for an outage" has no objective truth.
Subjective analysis: "Explain the trust impact of a release delay" has many valid answers.
Safety alignment: Deciding if a response is harmful usually requires policy judgment or an AI proxy, which is closer to RLHF or RLAIF than a deterministic correctness check.

Group relative policy optimization (GRPO)

DeepSeek-R1 used Group Relative Policy Optimization (GRPO), an algorithm introduced in DeepSeekMath^{[7]Reference 7DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Modelshttps://arxiv.org/abs/2402.03300} that avoids a learned value model by estimating advantages from a group of samples for the same prompt. DeepSeek-R1 later used GRPO in its reasoning-training stages.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

The problem with PPO

In standard Proximal Policy Optimization (PPO)^{[8]Reference 8Proximal Policy Optimization Algorithms.https://arxiv.org/abs/1707.06347}, we need a Critic model (or Value Function) that predicts "how good is this state?". The Critic estimates the expected future reward $V_\phi(s_t)$ .

In the terminal-reward setting common to LLM training (where the reward is observed only at the end of a sequence), the advantage simplifies to:

A_t = R - V_\phi(s_t)

Where $R$ is the final reward and $V_\phi$ is the critic's value estimate. More generally, the advantage uses a temporal-difference error $\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ , often computed via GAE (Generalized Advantage Estimation) over the full sequence.

Training a critic alongside an LLM adds model memory and compute. With terminal rewards, it must also predict eventual success from partial generations, which becomes difficult when a long response can later revise an earlier mistake.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

The GRPO solution

GRPO eliminates the separate Critic model entirely. Instead of learning a value function that tries to predict how good any reasoning trajectory is, GRPO computes advantages relative to the other samples generated for the exact same prompt in the current batch.

For verifiable tasks, the comparison is local to the prompt. For a given math problem or code task, sample several outputs. Some reach the checked outcome and receive high reward; others fail and receive low reward. The successful outputs are better than average for this sampled group. You don't need a learned critic that estimates difficulty across prompts; you do need reward variation inside each group.

Mathematically, for a prompt $q$ you sample a group of $G$ outputs $\{o_1, \dots, o_G\}$ . The advantage for output $i$ is the z-score of its reward within that group:

A_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_G\})}{\text{std}(\{r_1, \dots, r_G\})}

Positive advantages push the policy toward those trajectories; negative advantages push it away. Local normalization removes the separate critic-estimation problem and compares attempts at the same prompt difficulty. It doesn't guarantee stable training: if all sampled outputs receive the same reward, the group contributes no comparative signal.

Where $r_i$ is sample $i$ 's reward and normalization is done within the sampled group of size $G$ for the same prompt.

This concrete example uses $G = 8$ samples for a single math prompt. The verifier returns 1.0 for a correct final answer and 0.0 for an incorrect one:

Sample	Answer	Reward $r_i$	Group Mean	Group Std	Advantage $A_i$
1	168	1.0	0.5	0.5	+1.0
2	148	0.0	0.5	0.5	-1.0
3	156	0.0	0.5	0.5	-1.0
4	168	1.0	0.5	0.5	+1.0
5	170	0.0	0.5	0.5	-1.0
6	168	1.0	0.5	0.5	+1.0
7	144	0.0	0.5	0.5	-1.0
8	168	1.0	0.5	0.5	+1.0

How to read the table: Four samples got the right answer (168), so their rewards are 1.0. Four got it wrong, so their rewards are 0.0. The group mean is $4/8 = 0.5$ , and the standard deviation is 0.5. Sample 1's advantage is $(1.0 - 0.5) / 0.5 = +1.0$ . That positive advantage tells the optimizer to increase the probability of the tokens that led to sample 1. Sample 2's advantage is $(0.0 - 0.5) / 0.5 = -1.0$ , so the optimizer pushes down the probability of the tokens that led to sample 2.

Notice what the table hides as well as what it shows. Every correct answer gets the same positive advantage, and every wrong answer gets the same negative advantage. GRPO doesn't know why sample 1 was correct; it only knows that sample 1 was better than the average for this prompt. Over many prompts, the policy learns which reasoning patterns reliably produce above-average rewards.

GRPO group advantage diagram showing eight sampled answers, binary verifier rewards, group mean 0.5, group standard deviation 0.5, and positive or negative advantages. — GRPO replaces a learned critic with group-relative statistics. Samples that beat group mean get positive advantage; samples below the mean get pushed down.

The update resembles a PPO-clipped objective with group-relative advantage and a KL penalty. The published GRPO objective is applied at generated-token level; this sequence-level sketch keeps the two ideas visible without reproducing all implementation detail:

\begin{aligned} \mathcal{L}_{\text{GRPO}} = -\frac{1}{G} \sum_{i=1}^{G} \Big[ &\min\left(\rho_i A_i,\; \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i\right) \\ &- \beta D_{\text{KL}}(\pi_\theta \|\pi_{\text{ref}}) \Big] \end{aligned}

Where $\rho_i = \frac{\pi_\theta(o_i|q)}{\pi_{\text{old}}(o_i|q)}$ is a sequence-level policy-ratio shorthand, $A_i$ is the normalized group advantage, $\epsilon$ is the clipping threshold, and $\beta$ controls the drift penalty toward the reference policy.

The Kullback-Leibler (KL) term penalizes drift from a reference policy. It can constrain movement toward a verifier exploit; it can't prove that the verifier represents the intended behavior.

One caveat falls straight out of the formula: if every sample in the group gets the same reward, the standard deviation is 0 and vanilla GRPO gets no learning signal from that group. That's one reason prompt difficulty, group size, and reward shaping matter in practice.

Common mistake: Saying that three equal rewards out of four collapse the GRPO signal. They don't: the one different outcome creates reward variance. Signal disappears when every sampled output gets the same reward, such as all failures on a prompt that's too hard or all passes on a prompt that's too easy. DeepSeek-R1 reports sampling 16 outputs per question in its reasoning RL setups.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

This diagnostic identifies which prompt groups supply comparative signal before an update. The mixed group is usable even though three out of four answers fail; the all-fail and all-pass groups carry no relative ranking information.

reward-spread-diagnostic.py

import torch

groups = {
    "mixed": torch.tensor([0.0, 0.0, 0.0, 1.0]),
    "all_fail": torch.tensor([0.0, 0.0, 0.0, 0.0]),
    "all_pass": torch.tensor([1.0, 1.0, 1.0, 1.0]),
}

for name, rewards in groups.items():
    std = float(rewards.std(unbiased=False))
    informative = std > 0.0
    print(f"{name}: std={std:.3f}, informative={informative}")

Reward spread diagnostic

mixed: std=0.433, informative=True
all_fail: std=0.000, informative=False
all_pass: std=0.000, informative=False

This runnable scalar version of the GRPO math treats each sampled trajectory as one log-probability value, which is enough to test the group-normalized advantage, PPO-style clipping, and KL penalty. The KL term uses the positive estimator published with GRPO: $\exp(\log \pi_{\text{ref}} - \log \pi_\theta) - (\log \pi_{\text{ref}} - \log \pi_\theta) - 1$ . A production trainer applies the objective at token level, batches generation across all $G$ samples, and stores the old-policy log probabilities during rollout.^{[7]Reference 7DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Modelshttps://arxiv.org/abs/2402.03300}

the-grpo-solution.py

import torch

def group_advantages(rewards: torch.Tensor) -> torch.Tensor:
    std = rewards.std(unbiased=False)
    if float(std) == 0.0:
        return torch.zeros_like(rewards)
    return (rewards - rewards.mean()) / (std + 1e-8)

def clipped_grpo_loss(
    log_probs_new: torch.Tensor,
    log_probs_old: torch.Tensor,
    log_probs_ref: torch.Tensor,
    advantages: torch.Tensor,
    epsilon: float = 0.2,
    beta: float = 0.01,
) -> torch.Tensor:
    ratios = torch.exp(log_probs_new - log_probs_old)
    clipped_ratios = torch.clamp(ratios, 1 - epsilon, 1 + epsilon)
    policy_loss = -torch.min(ratios * advantages, clipped_ratios * advantages).mean()

    log_ratio_ref = log_probs_ref - log_probs_new
    kl = (torch.exp(log_ratio_ref) - log_ratio_ref - 1).mean()
    return policy_loss + beta * kl

rewards = torch.tensor([1, 0, 0, 1, 0, 1, 0, 1], dtype=torch.float32)
advantages = group_advantages(rewards)
advantages_match = bool(torch.allclose(advantages, torch.tensor([1, -1, -1, 1, -1, 1, -1, 1], dtype=torch.float32)))

trajectory_log_probs = torch.nn.Parameter(torch.zeros(8))
old_log_probs = torch.zeros(8)
reference_log_probs = torch.zeros(8)

optimizer = torch.optim.SGD([trajectory_log_probs], lr=0.1)
loss = clipped_grpo_loss(trajectory_log_probs, old_log_probs, reference_log_probs, advantages)
loss.backward()

# Positive-advantage trajectories should be pushed up; negative ones down.
grad_signs_ok = (
    trajectory_log_probs.grad is not None
    and trajectory_log_probs.grad[0] < 0
    and trajectory_log_probs.grad[1] > 0
)

optimizer.step()
step_direction_ok = trajectory_log_probs[0] > 0 and trajectory_log_probs[1] < 0

print("advantages:", [round(float(x), 1) for x in advantages])
print("updated log probs:", [round(float(x), 3) for x in trajectory_log_probs.detach()])
print(f"advantages_match={advantages_match}")
print(f"grad_signs_ok={bool(grad_signs_ok)}")
print(f"step_direction_ok={bool(step_direction_ok)}")

Output

advantages: [1.0, -1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0]
updated log probs: [0.013, -0.013, -0.013, 0.013, -0.013, 0.013, -0.013, 0.013]
advantages_match=True
grad_signs_ok=True
step_direction_ok=True

GRPO is an optimizer, not a synonym for RLVR. Tülu 3 trained its RLVR stage with PPO, while DeepSeekMath introduced GRPO as a PPO variant and DeepSeek-R1 used GRPO in its reasoning stages.^{[1]Reference 1Tülu 3: Pushing Frontiers in Open Language Model Post-Traininghttps://arxiv.org/abs/2411.15124}^{[7]Reference 7DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Modelshttps://arxiv.org/abs/2402.03300}^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} Both optimizers can consume rule-based verifier rewards.

Feature	PPO-style policy optimization	GRPO
Baseline	Predicted by a separate Critic model	Mean reward of the sampled group
Additional value-model cost	Trains a critic/value model	Avoids a separate critic
Main estimation risk	Critic must predict returns	Groups with equal rewards give no relative signal
Compatible reward signals	Learned or rule-based reward	Learned or rule-based reward, including binary checks

Self-check: In the worked table above, what would happen to the advantages if all 8 samples got the same reward? Why does that make prompt difficulty and group size important in practice?

Why GRPO often fits RLVR

Memory efficiency

No need to hold a large Critic model in GPU memory.

Comparative baseline

The group mean is local to the current prompt. GRPO avoids training a critic to predict returns from partial reasoning traces, though it still depends on reward spread, a sound verifier, and well-tuned optimization.

Binary rewards are a natural fit

Binary $\{0,1\}$ rewards make the group comparison easy to read: when a group contains both passes and failures, passes receive positive advantage and failures negative advantage. If all candidates pass or all fail, their advantages are zero. No value function is needed, but prompt selection and exploration must produce mixed outcomes often enough to learn.

DeepSeek-R1 training pipeline

RLVR success depends on the pipeline as well as the algorithm. DeepSeek-R1^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} used a specific multi-stage process to bootstrap reasoning.

DeepSeek-R1's reported pipeline alternates supervised fine-tuning and reinforcement learning to improve checked reasoning, readability, and broader assistant behavior:

Diagram showing Base model, Stage 1: cold-start SFT readable reasoning, Stage 2: reasoning RL GRPO + rule-based rewards, and Rule-based rewards accuracy + format + language consistency. — Base model, Stage 1: cold-start SFT readable reasoning, Stage 2: reasoning RL GRPO + rule-based rewards, and Rule-based rewards accuracy + format + language consistency.

Caption: DeepSeek-R1's reported pipeline starts with readable cold-start behavior, then applies GRPO with rule-based rewards on reasoning tasks, including a language-consistency reward. Later stages collect filtered traces into SFT data, mix in broader instruction data, and combine reasoning rewards with general-behavior reward models.

Stage 1: cold start (not mandatory, often useful)

DeepSeek-R1-Zero showed that DeepSeek-V3-Base could improve on reported reasoning evaluations through rule-based RL without a preliminary SFT phase.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} That result doesn't mean every base model or verifier will train usefully from zero-reward-heavy rollouts.

DeepSeek still added cold-start data for DeepSeek-R1 because R1-Zero had poor readability and mixed languages. The paper describes collecting thousands of long-CoT examples, filtering for a readable response pattern, and using them to initialize the model before RL.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} In other projects, cold start also helps when the base model's initial pass rate is so low that almost every rollout gets zero reward.

Stage 2: reasoning RL (GRPO)

This stage trains on math, code, and logic prompts using GRPO with rule-based rewards. In DeepSeek-R1-Zero, those rewards were accuracy plus format. In DeepSeek-R1, the first RL stage also added a language-consistency reward to reduce mixed-language chains of thought, with the paper reporting a slight performance trade-off in its ablation.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

For rule-graded prompts, a response that fails the checked outcome doesn't receive the accuracy reward no matter how persuasive its text is. That still leaves coverage risk: a pass can reflect a shortcut if the verifier or evaluation set fails to test it.

Stage 3: rejection sampling and SFT

Once the RL policy is producing higher-scoring traces, they can become new demonstrations. DeepSeek-R1 used rejection sampling, retaining checked-correct responses for rule-gradable data and using DeepSeek-V3 judgments for some expanded reasoning data.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

In the paper, that stage produced about 600k reasoning samples plus about 200k non-reasoning samples, for roughly 800k total.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} This turns exploratory RL behavior into a stable supervised dataset and helps recover capabilities that pure reasoning RL doesn't optimize for, such as writing quality and everyday instruction following.

Stage 4: final RL (All scenarios)

Pure reasoning RL can produce a model that's strong on benchmarks but rough around the edges. In the final stage, DeepSeek runs RL again, this time on a broader mix of scenarios.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

This uses a mix of:

Rule-based rewards to maintain strong reasoning capabilities.
Reward models (RLHF) for helpfulness and harmlessness in general conversation.

This final stage is intended to combine reasoning performance with broader helpfulness and harmlessness objectives; those outcomes must still be measured rather than assumed.

Observed reasoning patterns

Outcome-checked RL doesn't label each reasoning step. In DeepSeek-R1-Zero, the authors report longer responses and visible patterns such as verification, reflection, and exploration of alternatives during training.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} Treat those as observed output behaviors, not proof of an internal reasoning mechanism.

Self-verification patterns

An output may pause and check its work.

"Wait, let me double-check that calculation. $14 \times 12$ is $168$ , not $148$ . I made a mistake."

The same behavior appears when a model verifies a rate-limit calculation:

"The subtotal is $47.50 and tax is 6%. 47.50 × 0.06 = 2.85, so the total should be $50.35. Let me confirm: 47.50 + 2.85 = 50.35. Correct."

DeepSeek-R1-Zero wasn't trained from step-level labels saying "write let me check here." Its paper reports a sharp increase in the use of "wait" during reflection later in training.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} The cautious interpretation is that RL changed the distribution of generated traces and that some reflective-looking traces co-occurred with improved checked outcomes. An outcome-only verifier doesn't establish why those traces improved.

Backtracking patterns

An output may abandon one approach and try another.

"This approach using geometry seems too complex. Let me try using coordinate algebra instead."

This can look search-like on the surface, but the training loop is still autoregressive generation plus reward-weighted updates, not an explicit tree-search controller. A verifier credits the final checked outcome; it doesn't separately establish that the pivot was necessary.

Extended thought

In DeepSeek-R1-Zero, average response length jumped sharply during training alongside reported accuracy gains.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} This connects RLVR to test-time compute because trained policies may emit longer traces on reasoning tasks. Longer traces aren't automatically better, though; their value must be measured against correctness, latency, and token cost.

A useful self-check: suppose a model trained with outcome rewards starts saying "Wait, let me double-check that" before finalizing answers. What can you conclude from the output pattern, and what would require separate evaluation? The reward confirms correct final answers, not the necessity or faithfulness of the written reflection.

Research caution: It remains open how much RLVR creates new problem-solving behavior versus eliciting behavior already likely under the base model. Shao et al. found that random and format-only rewards substantially improved MATH-500 results for Qwen2.5-Math in their setup, while comparable spurious rewards gave little benefit or harmed Llama3 and OLMo2 variants.^{[9]Reference 9Spurious Rewards: Rethinking Training Signals in RLVRhttps://arxiv.org/abs/2506.10947} They connect this result to GRPO clipping and model-specific high-prior behaviors. Practical takeaway: compare against weak or spurious-reward baselines and validate across model families and held-out tasks.

Reward hacking and failure modes

Any time you optimize a metric, a policy can exploit gaps in that metric. RLVR is no exception. If the verifier checks only final answers, a memorized answer, leaked target, or weak parser can receive reward without demonstrating general solution ability. DeepSeek-R1 explicitly identifies reliable reward construction as a limitation once tasks can't be graded by dependable rules.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

Don't train against a verifier before adversarially testing it. Any malformed output or shortcut that earns reward can be reinforced by the optimization loop.

Format gaming

A format-heavy reward can favor output wrappers over correctness. For example, if the verifier gives substantial credit for \boxed{...} before checking the value, it can reinforce neatly formatted wrong answers. A parser that accepts any matching line also risks rewarding output that contains several contradictory answers.

High format compliance with low task accuracy means the reward function prizes formatting before correctness. Make correctness dominant and fail closed on wrong or ambiguous contents, even if the wrapper is present.

This example compares a broken format-first reward with a correctness-gated version. A wrong boxed answer should never get more reward than an unboxed correct answer just because it's easy to parse.

format-reward-trap.py

import re

def boxed_value(response: str) -> int | None:
    matches = re.findall(r"\\boxed\{(\d+)\}", response)
    return int(matches[0]) if len(matches) == 1 else None

def broken_reward(response: str, expected: int) -> float:
    value = boxed_value(response)
    return 0.7 if value is not None else (0.3 if str(expected) in response else 0.0)

def gated_reward(response: str, expected: int) -> float:
    value = boxed_value(response)
    return 1.0 if value == expected else 0.0

wrong_boxed = r"The answer is \boxed{148}."
right_plain = "The answer is 168."

print(f"broken_prefers_wrong={broken_reward(wrong_boxed, 168) > broken_reward(right_plain, 168)}")
print(f"gated_wrong_boxed={gated_reward(wrong_boxed, 168)}")

Format reward trap

broken_prefers_wrong=True
gated_wrong_boxed=0.0

Shortcut exploitation

If a training set has shortcuts (for example, one multiple-choice position is correct much more often), optimizing checked training reward can favor that shortcut. A code verifier that treats execution errors as passing outcomes creates an even more direct loophole.

Mitigation strategies

To prevent policies from exploiting the verifier, several defensive practices help during training:

Explicit verifier contracts: Use symbolic checking (such as sympy) and isolated execution environments instead of brittle regex parsing. Ensure the verifier fails closed (returns 0 on any parsing error).
Difficulty calibration: Include prompts where rollouts produce both passing and failing outputs often enough for group-relative learning; all-zero groups contribute no comparative advantage.
Held-out tests: Evaluate on independent problems and, when possible, an independently implemented checker so a parser shortcut isn't mistaken for generalization.
Length controls: Cap rollout length or add carefully tuned penalties so the model doesn't learn to think forever without improving correctness.

RLVR verifier pressure map showing rollouts flowing through extraction, checking, and policy update, with scoped verification measuring checked outcomes while weak verification rewards shortcuts. — RLVR pushes on whatever the verifier measures. A scoped checker can prove selected outcomes; a weak checker can reward shortcuts.

When RLVR breaks

Assuming RLVR works for any task. A reward function that secretly depends on taste, safety judgment, or fuzzy grader text is forcing an objective RL method onto a subjective task. Ask whether a deterministic checker, theorem prover, unit test suite, database constraint, or schema validator can grade the output. If not, use RLHF, DPO, Reinforcement Learning from AI Feedback (RLAIF), or decompose the task into smaller verifiable checks.

Treating cold-start SFT as either mandatory or useless. DeepSeek-R1-Zero improved reported reasoning-task results from DeepSeek-V3-Base without a preliminary SFT phase.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} DeepSeek-R1 still used cold-start SFT because readability and language consistency mattered. Measure the base model's initial pass rate and output quality. If almost every rollout gets zero reward or the traces are unreadable, cold-start data may make RL more tractable.

Celebrating format accuracy. A verifier that rewards \boxed{} too strongly can teach box-writing instead of problem-solving. Keep correctness dominant, run adversarial parser tests, and track real task accuracy separately from format compliance.

Ignoring general capability drift. A model trained heavily on math or code RLVR can become better at those tasks while getting worse at normal instruction following. Use KL anchoring, mix in general instruction data during later SFT stages, and run broad evaluations such as writing, safety, and everyday chat alongside math or code metrics.

This release gate catches a reasoning gain that comes with an unacceptable instruction-following regression. The numbers are illustrative; each team should define thresholds before training.

capability-regression-gate.py

baseline = {"checked_reasoning": 0.61, "instruction_following": 0.92, "false_refusal": 0.04}
candidate = {"checked_reasoning": 0.74, "instruction_following": 0.81, "false_refusal": 0.13}

violations = []
if candidate["checked_reasoning"] <= baseline["checked_reasoning"]:
    violations.append("no reasoning gain")
if candidate["instruction_following"] < baseline["instruction_following"] - 0.03:
    violations.append("instruction following regressed")
if candidate["false_refusal"] > baseline["false_refusal"] + 0.02:
    violations.append("false refusals increased")

print(f"reasoning_gain={candidate['checked_reasoning'] - baseline['checked_reasoning']:+.0%}")
print(f"ship={not violations}")
print(violations)

Capability gate output

reasoning_gain=+13%
ship=False
['instruction following regressed', 'false refusals increased']

Treating RLVR and distillation as competing choices. RLVR can improve checked outcomes for a teacher; distillation transfers sampled teacher behavior into a smaller model. Decide which bottleneck you have: if the teacher fails verifiable evaluations, train against better checks; if a capable teacher is too expensive to serve, consider distillation.

A tiny verifier lab

To understand RLVR, build a verifier yourself. You don't need a GPU cluster or a billion-parameter model. A Python script and a small set of arithmetic problems are enough to see the mechanics.

Checkpoint: Before building the verifier, state the exact contract: require one answer field, extract one number, compare with tolerance, and fail closed on missing, duplicated, or malformed fields. If that contract feels vague, re-read the math verifier example above.

Exercise: Write a verifier that takes a model's raw text output and returns 1.0 if the answer is correct and 0.0 otherwise. Use the following prompt and ground-truth pairs:

Prompt	Ground Truth
"A GPU batch runner has 14 workers with 12 slots each. How many concurrent jobs fit?"	168
"A scheduler runs 8 jobs per wave. How many waves are needed for 96 jobs?"	12
"A request budget is 47.50 units with a 6% overhead. What's the total?"	50.35

Step 1: Implement extract_answer(text: str) -> str | None that accepts exactly one \boxed{...} answer field. Reject missing or duplicated fields rather than guessing which number the model intended as final.

Step 2: Implement verify(prompt: str, response: str, ground_truth: float) -> float that extracts the predicted answer, compares it to the ground truth with a small tolerance (e.g., abs(predicted - expected) < 1e-3), and returns 1.0 or 0.0.

Step 3: Test your verifier against these three model outputs for the first prompt:

"There are 14 shelves and 12 boxes each. 14 × 12 = 168. The answer is $\boxed{168}$ ."
"The total is $\boxed{148}$ ."
"It might be $\boxed{168}$ or $\boxed{148}$ ."

Expected results: output 1 should return 1.0. Outputs 2 and 3 should return 0.0. The third response contains the correct value, but its final answer is ambiguous.

Start from this minimal solution:

a-tiny-verifier-lab.py

import re

def extract_answer(text: str) -> str | None:
    boxed = re.findall(r"\\boxed\{(-?\d+(?:\.\d+)?)\}", text)
    return boxed[0] if len(boxed) == 1 else None

def verify(prompt: str, response: str, ground_truth: float) -> float:
    answer = extract_answer(response)
    if answer is None:
        return 0.0
    try:
        predicted = float(answer)
    except ValueError:
        return 0.0
    return 1.0 if abs(predicted - ground_truth) < 1e-3 else 0.0

prompt = "A GPU batch runner has 14 workers with 12 slots each. How many concurrent jobs fit?"
correct_score = verify(prompt, r"There are 14 shelves and 12 boxes each. 14 * 12 = 168. The answer is \boxed{168}.", 168)
wrong_score = verify(prompt, r"The total is \boxed{148}.", 168)
ambiguous_score = verify(prompt, r"It might be \boxed{168} or \boxed{148}.", 168)

print(correct_score)
print(wrong_score)
print(ambiguous_score)
print(f"correct_passes={correct_score == 1.0}")
print(f"wrong_rejected={wrong_score == 0.0}")
print(f"ambiguous_rejected={ambiguous_score == 0.0}")

Output

1.0
0.0
0.0
correct_passes=True
wrong_rejected=True
ambiguous_rejected=True

Step 4 (optional): Add a format reward. Give +0.2 for using \boxed{} correctly, and +0.8 for a correct answer inside the box. What behavior does this mixed reward encourage? Does the model still get a positive reward if the box is present but the answer is wrong? For a strict production verifier, decide whether that partial credit is worth the gaming risk or whether correctness should gate every positive reward.

DeepSeek-R1-Zero combined accuracy rewards with format rewards targeting a parseable response structure.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} Mixed rewards require careful testing so formatting doesn't dominate correctness.

RLVR and distillation

RLVR and distillation aren't mutually exclusive. DeepSeek-R1 used RL in its teacher pipeline, then fine-tuned smaller dense models on 800k generated training samples.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} The systems distinction is clean: RLVR optimizes a policy against checks, while distillation trains a student from teacher outputs.

Aspect	RLVR	Distillation
Training signal	Online reward from a verifier	Offline supervision from teacher outputs
What it optimizes	Checked success under a specified contract	Imitation of teacher behavior
Compute profile	Online sampling, verification, and RL updates	Supervised training over collected traces
Dependency	Needs a reliable verifier	Needs a useful teacher and clean trace data
Best use	Improve checked outcomes for selected tasks	Transfer teacher behavior into smaller models
Main failure mode	Reward hacking or sparse-credit collapse	Student inherits teacher blind spots and data coverage limits

DeepSeek-R1 makes this trade-off concrete. The paper reports that distillation into smaller Qwen and Llama models outperformed its reported RL experiment on smaller models in the evaluated setting.^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} That's evidence for testing distillation when a strong teacher exists, not a universal ranking of the two methods.

RLVR still matters because distillation doesn't optimize the student online against a verifier. If the teacher's checked outcomes are insufficient, improving that policy against well-tested verifiers is a different operation from transferring its sampled outputs.

Follow-up questions

Why might a lab train a teacher with RLVR before distilling it?

RLVR can raise the teacher's checked success rate through online sampling and reward. Once a teacher is strong on the relevant evaluations, distillation can turn sampled traces into supervised data for smaller models.

How would you design a verifier beyond math and code?

Start with structured outputs. A deployment planner can emit JSON for region, capacity, cost, and cutoff time; the verifier can check schema validity, arithmetic, quota constraints, and rollout-window feasibility. For retrieval or support tasks, a verifier might check cited document IDs against an index, confirm policy-section references, or reject answers that don't include required fields. The reward can be shaped, but every component should still fail closed.

What happens to general conversation ability during RLVR?

It can degrade if training only rewards narrow reasoning tasks. Mitigations include reference-policy KL, broad SFT data after rejection sampling, final preference-style alignment for helpfulness and harmlessness, and a regression suite that includes normal chat, writing, safety, and instruction following.

Could RLVR handle open-ended tasks with a strong enough verifier?

Sometimes. If you can reduce the task to objective properties, such as "all facts cite matching database rows" or "all required form fields are valid," RLVR can optimize those pieces. When the core quality judgment remains subjective, the setup becomes closer to RLHF or RLAIF than classic RLVR.

Mastery check

Key concepts

Define RLVR as reinforcement learning against programmatic verification, not learned human-preference scoring.
Decide whether a task belongs in RLVR, RLHF, DPO, Reinforcement Learning from AI Feedback (RLAIF), or a decomposed hybrid.
Explain why GRPO can remove the critic by comparing samples from the same prompt group.
Describe the self-verification, backtracking, and longer-response patterns observed during rule-reward RL without overstating what outcome reward proves.
Debug a verifier that creates format gaming, shortcut exploitation, sparse-reward collapse, or general-skill drift.
Explain why a training pipeline may use RLVR to improve checked teacher outcomes and distillation to make useful behavior cheaper to serve.

What strong answers show

Strong: separates RLVR from RLHF, DPO, and distillation by naming the supervision source and bottleneck for each.
Strong: explains verifier design as a fail-closed contract, not a loose regex or grading heuristic.
Strong: can derive GRPO group-relative advantage and explain why weak reward spread or weak verifiers break learning.
Strong: treats visible self-verification and backtracking as observed generation patterns, not proof that an outcome verifier taught a general reasoning module.
Weak: assumes any subjective task can be forced into RLVR or confuses format accuracy with real task success.
Weak: treats RLVR and distillation as substitutes instead of teacher-improvement versus teacher-compression stages.

When RLVR breaks

Rewarding output wrappers more than correctness. Symptom: nice \boxed{} formatting with bad answers. Fix: keep verifier fail-closed and make correctness dominate format bonuses.
Using RLVR on tasks that still need taste or policy judgment. Symptom: noisy verifier rules that secretly act like a brittle human preference model. Fix: switch to RLHF, DPO, or Reinforcement Learning from AI Feedback (RLAIF), or split task into smaller verifiable checks.
Ignoring broad capability drift after narrow reasoning RL. Symptom: math improves while normal chat or writing gets worse. Fix: keep KL anchoring, mix broader SFT data, and run non-reasoning regressions.

Next Step

Continue to Knowledge Distillation for LLMs

Verifiable rewards can improve checked outcomes in a teacher; distillation asks how to transfer useful teacher behavior into a smaller, cheaper model.

PreviousConstitutional AI & Red Teaming

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Lambert, N., et al. · 2024 · arXiv preprint

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., et al. · 2024

Proximal Policy Optimization Algorithms.

Schulman, J., et al. · 2017

Spurious Rewards: Rethinking Training Signals in RLVR

Shao, R., et al. · 2025

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

RLVR & Verifiable Rewards

The alignment space

Defining verifiable rewards

Why is a binary verifier attractive for RLVR?

Outcome vs. process supervision

Examples of verifiers

Mathematics

Code generation

Formal logic

What can't be verified?

Why must an RLVR verifier fail closed?

Group relative policy optimization (GRPO)

The problem with PPO

The GRPO solution

What does a positive GRPO advantage mean?

Why GRPO often fits RLVR

Memory efficiency

Comparative baseline

Binary rewards are a natural fit

DeepSeek-R1 training pipeline

Stage 1: cold start (not mandatory, often useful)

Stage 2: reasoning RL (GRPO)

Stage 3: rejection sampling and SFT

Stage 4: final RL (All scenarios)

Observed reasoning patterns

Self-verification patterns

Backtracking patterns

Extended thought

Reward hacking and failure modes

Format gaming

Shortcut exploitation

Mitigation strategies

When RLVR breaks

A tiny verifier lab

RLVR and distillation

Follow-up questions

Why might a lab train a teacher with RLVR before distilling it?

How would you design a verifier beyond math and code?

What happens to general conversation ability during RLVR?

Could RLVR handle open-ended tasks with a strong enough verifier?

Mastery check

Key concepts

What strong answers show

When RLVR breaks

Mastery Check

Discussion