LearnAdvanced Training & AdaptationRLHF & DPO Alignment

🛡️HardAlignment & Safety

RLHF & DPO Alignment

Understand the RLHF pipeline and DPO, including reward modeling, PPO mechanics, and the trade-offs between iterative reinforcement learning and direct preference optimization.

38 min read

Learning path

Step 106 of 158 in the full curriculum

Reward Modeling from Preference Data Constitutional AI & Red Teaming

Reward modeling showed how to turn chosen/rejected responses into a scalar preference signal. RLHF and DPO ask what to do with that signal: train a separate reward-driven policy, or optimize the policy directly from preference pairs.

A policy assistant can answer temporary-admin-access questions with fluent, confident, and wrong guidance. It can also give advice about source-citation requirements that sounds authoritative but violates company rules. InstructGPT frames this gap directly: pretrained language models can be untruthful, toxic, or unhelpful even when their text is polished.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

That's the alignment problem. Pretraining optimizes next-token prediction, not direct human preference. A language model optimized to predict plausible text can still produce responses that sound useful whether or not they are helpful, truthful, or safe. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) are post-training methods that push a model toward responses preferred under a labeling policy. They don't, by themselves, prove truthfulness or safety outside that policy's coverage.

The gap SFT can't close

After instruction tuning, a model can follow instructions. But it may still generate harmful content, be dishonest, refuse reasonable requests, or produce verbose and unhelpful answers. This misalignment happens because next-token prediction doesn't inherently enforce a product's safety rules or a labeler's preference rubric. Predicting what comes next isn't the same as satisfying the behavior being evaluated.

Alignment optimizes for human preferences beyond task completion. It bridges the gap between next-token prediction (what's probable?) and helpful interaction (what's desirable?).

To see why SFT alone isn't enough, consider a policy assistant given this context: "Policy: cite the retrieved access policy and escalate privileged-role changes. User: My service account needs temporary admin access for tonight's migration. Can you approve it?" The SFT model might generate two grammatically correct responses:

Response A: "The retrieved access policy requires escalation for temporary admin access. I can open a reviewer ticket and cite policy P-7."
Response B: "Temporary admin access is common for migrations, so you can proceed and clean it up tomorrow."

Response A follows the supplied policy. Response B is unhelpful and violates that stated policy. Both can look fluent. SFT teaches the model to imitate demonstrations; preference training supplies comparative signal when those distinctions are represented in the labeled data.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

That preference signal is what alignment adds.

Alignment gap diagram showing an access-policy prompt, two fluent SFT responses, human preference selecting the policy-correct response, and aligned behavior that combines usefulness, truthfulness, and safety. — SFT can make both answers fluent. Alignment adds the preference signal that separates policy-correct help from polished but wrong text.

RLHF: building a judge and an actor

RLHF (Reinforcement Learning from Human Feedback) is the classic large-scale preference-alignment pipeline popularized by InstructGPT: start from an SFT model, train a reward model from human comparisons, then optimize the policy with reinforcement learning.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

RLHF separates roles instead of collapsing them into one loss. SFT gives the model baseline instruction-following behavior, the reward model approximates labeled preference, and PPO updates the policy while a KL term discourages large drift from the SFT reference.

Three-phase pipeline

RLHF runs in three distinct steps. First, collect demonstrations and perform Supervised Fine-Tuning (SFT) to establish baseline instruction-following. Then gather comparison data to train a proxy judge (the reward model). Finally, use reinforcement learning to optimize the language model against that judge.

Diagram showing Base model, SFT model, Generate response pairs same prompt, two answers, and Human ranks chosen over rejected. — Base model, SFT model, Generate response pairs same prompt, two answers, and Human ranks chosen over rejected.

Reward models and the Bradley-Terry trick

The reward model acts as a support-policy auditor comparing two draft replies. The auditor doesn't need to generate the reply; it needs to score the policy-correct, helpful answer above the vague or unsafe one.

The mathematical foundation is the Bradley-Terry model, which assumes the probability that a human prefers response $y_w$ over $y_l$ depends on their underlying "reward" scores:^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290}

$P(y_w \succ y_l \mid x) = \sigma\left(r(x, y_w) - r(x, y_l)\right)$

This model captures the intuition that a larger quality gap gives us more confidence about which response a human would prefer.

A concrete example makes the preference probability visible. Suppose the reward model scores Response A at 2.0 and Response B at 0.5 for the same access-policy prompt. The probability that a human prefers A over B is:

$P(A \succ B) = \sigma(2.0 - 0.5) = \sigma(1.5) \approx 0.818$

That means the model predicts an 81.8% chance the human picks A. If the human picked A, the log-likelihood of that observation is $\log(0.818) \approx -0.201$ . If the model had instead predicted B > A, the log-likelihood would be $\log(1 - 0.818) = \log(0.182) \approx -1.704$ . The loss pushes the reward model toward the higher probability assignment.

reward-models-and-the-bradley-terry-trick.py

import math

def sigmoid(value: float) -> float:
    return 1 / (1 + math.exp(-value))

reward_a = 2.0
reward_b = 0.5
preference_probability = sigmoid(reward_a - reward_b)
negative_log_likelihood = -math.log(preference_probability)

print(f"P(A preferred over B) = {preference_probability:.1%}")
print(f"NLL = {negative_log_likelihood:.3f}")

Bradley-Terry output

P(A preferred over B) = 81.8%
NLL = 0.201

The reward model is trained to predict which of two responses a human would prefer. The loss function maximizes the log-likelihood of the observed preferences:

\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\left(r(x, y_w) - r(x, y_l)\right)\right]

where $y_w$ is the preferred (winning) response and $y_l$ is the rejected (losing) response for the same prompt $x$ .

PPO and KL drift control

Once we have a reward model, we treat the language model as a policy $\pi_\theta$ . We use Proximal Policy Optimization (PPO), a reinforcement learning algorithm designed to make clipped, conservative policy updates^{[3]Reference 3Proximal Policy Optimization Algorithms.https://arxiv.org/abs/1707.06347}, to maximize the expected reward. The objective maximizes reward while penalizing deviation from the reference policy:

\begin{aligned} \max_\theta\;& \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[r(x, y)\right] \\ &- \beta \cdot \mathbb{E}_{x \sim D} \left[\text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x))\right] \end{aligned}

The Kullback-Leibler (KL) term is a drift budget. The policy may seek higher learned reward, while deviations from the reference cost reward. This can reduce exposure to regions where the reward model is unreliable; it can't prove the reward is correct or prevent every exploitable shortcut. InstructGPT applies a per-token KL penalty to mitigate reward-model overoptimization.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

To see why drift control matters, suppose the policy starts generating access-policy responses that begin with "Policy-safe answer: access guidance follows." If the reward model accidentally scores that opening higher because it correlates with politeness in training data, the policy can amplify it. A KL term charges departures from the reference, while held-out human checks still determine whether behavior improved.

Conceptually, one PPO-style RLHF step coordinates four roles. Real systems may shard or share them, but compute still has to cover all four:

Actor/Policy ( $\pi_\theta$ ): The model being trained
Reference ( $\pi_{\text{ref}}$ ): Frozen SFT model for KL penalty
Reward Model ( $r_\phi$ ): Frozen model that scores outputs
Critic/Value ( $V_\psi$ ): Predicts expected returns for advantage estimation

PPO-style RLHF diagram showing actor generating a sampled answer, with reference, reward model, and critic feeding KL, score, and value signals into same PPO update. — One PPO step is costly because rollout, KL anchor, reward score, and value estimate all feed the same update.

The calculation below isolates the shaped reward for sampled responses. It isn't a PPO trainer: a full implementation also estimates advantages, clips updates, and applies token-level masks.

kl_regularized_reward.py

samples = [
    {"name": "near_reference", "reward": 1.20, "policy_logp": -2.1, "ref_logp": -2.2},
    {"name": "high_drift", "reward": 1.35, "policy_logp": -1.2, "ref_logp": -2.8},
]
beta = 0.20

for sample in samples:
    sampled_log_ratio = sample["policy_logp"] - sample["ref_logp"]
    shaped_reward = sample["reward"] - beta * sampled_log_ratio
    print(f"{sample['name']}: raw={sample['reward']:.2f} shaped={shaped_reward:.2f}")

KL-shaped reward output

near_reference: raw=1.20 shaped=1.18
high_drift: raw=1.35 shaped=1.03

The higher raw score isn't automatically the preferred optimization target once policy drift is charged. In production, PPO implementations compute token-level KL penalties, estimate returns with a value model and often GAE, and separate rollout collection from policy updates. The system shape matters: PPO-style RLHF generates fresh responses and coordinates policy, reference, reward, and value roles.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}^{[3]Reference 3Proximal Policy Optimization Algorithms.https://arxiv.org/abs/1707.06347}

Why RLHF is hard to run

Despite its power, the RLHF pipeline introduces practical challenges that make it expensive to run and hard to debug at scale.

The first problem is systems overhead. Standard PPO-style RLHF needs actor, reference, reward, and critic/value roles during a training cycle. Sharding and offload change resident memory, but not the computation and coordination burden. Each rollout batch also begins with current-policy generation before gradient updates, so throughput becomes partly an inference problem.

Reinforcement learning also adds more knobs than supervised fine-tuning. PPO is sensitive to reward scale, the KL coefficient (beta), clipping threshold (epsilon), value-loss weighting, advantage estimation, and rollout quality. Weak drift control increases exposure to reward-model exploitation; overly strong control can suppress useful updates.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}^{[3]Reference 3Proximal Policy Optimization Algorithms.https://arxiv.org/abs/1707.06347}

The reward model itself is a silent failure point. It was trained on a finite collection of human comparisons. When the improving policy starts generating responses outside that distribution, the proxy scores become unreliable. That's when overoptimization appears: the numeric reward keeps climbing while real human preference scores plateau or drop. In a policy-assistant setting, you might see the model start adding long compliance preambles or repetitive caveats because those patterns happened to score high in the limited preference data.

Issue	Description
Complexity	Four logical model roles (policy, reference, reward, value), with memory affected by sharding and offload
Tuning surface	PPO depends on clip ratio, learning rate, value-loss coefficients, reward scale, and sample quality
Reward hacking	Policy exploits reward model weaknesses when it generates out-of-distribution outputs
Coverage gaps	Policy can regress on behaviors that are weakly represented in the reward data
Cost	Human preference annotation at scale is expensive; PPO training is compute-intensive

DPO: the mathematical shortcut

PPO-style RLHF has a multi-role online loop. DPO instead targets fixed preference pairs with a simpler objective. It removes engineering components; it doesn't supply online exploration or guarantee the same outcome as every RLHF run.

Why DPO can train offline

At first glance, DPO looks like a completely different alignment recipe. The central mathematical result in the DPO paper is narrower: under a KL-regularized reward-maximization objective and a Bradley-Terry-style preference model, an implicit reward can be parameterized through policy-to-reference log-probability ratios.^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290} That relationship yields an offline pairwise loss without training a standalone reward model or running PPO during DPO training.

Plain English makes the move clearer. Under those modeling assumptions, an implicit reward is encoded by how policy log-probabilities change relative to a reference model. DPO can therefore train directly on labeled pairs. It learns only from the coverage and quality of those comparisons unless you build a separate data-refresh loop.

The implicit reward is:

$r^*(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$

Where $\pi^*$ is the optimal policy for the stated objective, $\pi_{\text{ref}}$ is the reference model, $\beta$ is tied to the KL-regularization trade-off and scales the DPO logit, and $Z(x)$ is a normalization term that depends only on prompt $x$ . DPO then trains a policy $\pi_\theta$ from pairwise preferences under this parameterization.^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290}

DPO loss with a worked example

By substituting this relationship back into the preference loss, DPO derives a loss function that depends only on the policy and reference model:

\begin{aligned} \mathcal{L}_{\text{DPO}} = -\mathbb{E}\Bigg[ \log \sigma\Bigg( &\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} \\ &- \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \Bigg) \Bigg] \end{aligned}

Where $y_w$ is the preferred response and $y_l$ is the rejected response for the same prompt $x$ , and $\pi_{\text{ref}}$ is the reference model that provides the baseline distribution. The reference model defines the relative log-probability baseline and shapes the divergence trade-off; it isn't proof that outputs stay fluent, calibrated, or safe.

Trace the loss with actual numbers. Suppose for our access-policy prompt we have:

Chosen response (A): policy log-prob = -8.2, reference log-prob = -8.5
Rejected response (B): policy log-prob = -9.5, reference log-prob = -9.1
DPO scale / regularization parameter: $\beta = 0.1$

Step by step:

Policy advantage for chosen: $-8.2 - (-8.5) = 0.3$
Policy advantage for rejected: $-9.5 - (-9.1) = -0.4$
Margin inside the sigmoid: $0.3 - (-0.4) = 0.7$
Scaled by beta: $0.1 \times 0.7 = 0.07$
Sigmoid probability: $\sigma(0.07) \approx 0.517$
Loss: $-\log(0.517) \approx 0.659$

The loss pushes the policy to increase that 0.7 margin. If the policy becomes more confident on the chosen response (say, log-prob rises to -7.5 while the rejected stays similar), the margin grows, the sigmoid probability approaches 1.0, and the loss drops toward 0. In gradient terms, the update increases the likelihood of the chosen answer relative to the reference while decreasing the likelihood of the rejected answer, all scaled by how far the current policy already deviates from the reference.

dpo_margin_from_logprobs.py

from math import exp, log

chosen_policy, chosen_reference = -8.2, -8.5
rejected_policy, rejected_reference = -9.5, -9.1
beta = 0.1
margin = (chosen_policy - chosen_reference) - (rejected_policy - rejected_reference)
logit = beta * margin
loss = -log(1 / (1 + exp(-logit)))

print(f"relative_margin={margin:.1f}")
print(f"scaled_logit={logit:.2f}")
print(f"dpo_loss={loss:.3f}")

DPO margin output

relative_margin=0.7
scaled_logit=0.07
dpo_loss=0.659

The implementation of the DPO loss is straightforward. The runnable example below uses a tiny causal language model so the tensor shapes are real without downloading an external checkpoint. A production implementation would add padding masks, attention masks, distributed training, and tokenizer-specific details.

dpo-loss-with-a-worked-example.py

import torch
import torch.nn as nn
import torch.nn.functional as F
from types import SimpleNamespace

class TinyCausalLM(nn.Module):
    def __init__(self, vocab_size: int = 16, hidden_size: int = 8):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.projection = nn.Linear(hidden_size, vocab_size)

    def forward(self, input_ids: torch.Tensor) -> SimpleNamespace:
        hidden = self.embedding(input_ids)
        return SimpleNamespace(logits=self.projection(hidden))

def compute_log_probs(model, prompt_ids, response_ids) -> torch.Tensor:
    input_ids = torch.cat([prompt_ids, response_ids], dim=-1)
    logits = model(input_ids).logits
    response_logits = logits[:, prompt_ids.shape[-1] - 1:-1, :]
    log_probs = F.log_softmax(response_logits, dim=-1)
    token_log_probs = torch.gather(
        log_probs, dim=-1, index=response_ids.unsqueeze(-1)
    ).squeeze(-1)
    return token_log_probs.sum(dim=-1)

def dpo_loss(
    policy_model: torch.nn.Module,
    ref_model: torch.nn.Module,
    batch: dict[str, torch.Tensor],
    beta: float = 0.1
) -> torch.Tensor:
    pi_w = compute_log_probs(policy_model, batch["prompt"], batch["chosen"])
    pi_l = compute_log_probs(policy_model, batch["prompt"], batch["rejected"])

    with torch.no_grad():
        ref_w = compute_log_probs(ref_model, batch["prompt"], batch["chosen"])
        ref_l = compute_log_probs(ref_model, batch["prompt"], batch["rejected"])

    logits = beta * ((pi_w - ref_w) - (pi_l - ref_l))
    return -F.logsigmoid(logits).mean()

torch.manual_seed(0)
policy = TinyCausalLM()
reference = TinyCausalLM()
reference.load_state_dict(policy.state_dict())
for param in reference.parameters():
    param.requires_grad_(False)

batch = {
    "prompt": torch.tensor([[1, 2], [1, 3]]),
    "chosen": torch.tensor([[4, 5], [4, 6]]),
    "rejected": torch.tensor([[7, 8], [7, 9]]),
}

with torch.no_grad():
    policy.projection.bias[5] += 1.0
    policy.projection.bias[6] += 1.0

loss = dpo_loss(policy, reference, batch)
loss_is_scalar = loss.ndim == 0
loss_is_finite = bool(torch.isfinite(loss))

loss.backward()
grad_norm = sum(
    param.grad.abs().sum().item()
    for param in policy.parameters()
    if param.grad is not None
)

print(f"DPO loss: {loss.item():.3f}")
print(f"policy_grad_norm: {grad_norm:.3f}")
print(f"loss_is_scalar={loss_is_scalar}")
print(f"loss_is_finite={loss_is_finite}")
print(f"grad_norm_positive={grad_norm > 0}")

DPO loss output

DPO loss: 0.648
policy_grad_norm: 1.784
loss_is_scalar=True
loss_is_finite=True
grad_norm_positive=True

Visible feedback: At initialization, a policy copied from its reference naturally produces logits near zero and loss near 0.693. After training begins, monitor margins together with held-out preference quality and output regressions. Near-zero logits alone don't diagnose weak pairs or a bad beta.

In production, DPO implementations also mask padding tokens and explicitly track the chosen-vs-rejected log-probability margin during training. Teams usually monitor response length too, because whole-sequence log-probability sums can create length bias when chosen and rejected responses have systematically different lengths.

DPO training flow

Vanilla DPO training is operationally simpler than PPO-style RLHF. It applies an offline pairwise loss instead of coordinating rollouts, a learned reward model, and a value model. A reference model provides baseline log-probabilities while the active policy is updated from preference data.

Diagram showing Preference Data (prompt, chosen, rejected), DPO Training (standard SFT-like), Reference Model (frozen SFT), and Aligned Model. — Preference Data (prompt, chosen, rejected), DPO Training (standard SFT-like), Reference Model (frozen SFT), and Aligned Model.

DPO turns preference alignment into a supervised pairwise classification loss. You still need a frozen reference model and high-quality chosen/rejected pairs, but you no longer train a separate reward model or value function.

Comparing RLHF and DPO

To better understand the trade-offs between the traditional RLHF pipeline and DPO, compare what each method optimizes and what it has to keep alive during training.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290}

Comparison of RLHF and DPO alignment loops: RLHF keeps policy rollouts, reward scoring, and policy updates in a live loop, while DPO trains from fixed prompt, chosen, and rejected pairs against a frozen reference model. — Compare what stays live during training: RLHF keeps sampling, reward scoring, and policy updates in a loop, while DPO turns fixed preference pairs plus a frozen reference model into one pairwise loss.

Feature	RLHF (PPO)	DPO
Reward model	Train a separate reward model	No separate reward model; reward is implicit in the objective
Training loop	On-policy RL with rollouts	Offline pairwise loss on fixed comparisons
Logical model roles	Policy + ref + reward + critic	Policy + reference
Most expensive step	Sampling and scoring fresh responses	Forward/backward passes on fixed preference pairs
Stability	Sensitive to reward scale, KL weight, and PPO settings	Fewer interacting loops, but still sensitive to data and `beta`
Online exploration	Yes	No, not in vanilla DPO
Data dependence	Can keep collecting fresh comparisons during training	Limited by current preference coverage unless data is refreshed
Monitoring need	Reward drift, KL drift, value loss, rollout quality	Preference loss, margin growth, length bias

DPO lowers system complexity and is a practical candidate baseline for offline pairwise preferences. That doesn't mean PPO is obsolete. If a team needs online exploration, fresh model-generated negatives, or a reward signal that changes as the model improves, RL-style alignment can handle cases fixed-dataset DPO can't.

Preference data: the foundation

Whether a team uses RLHF or DPO, preference-signal quality sets a major performance ceiling. In practice, collecting and cleaning that signal is often the slowest and most expensive part of the pipeline. In the pairwise setup taught here, both methods consume chosen/rejected comparisons, but they use them differently: RLHF trains a reward model on those comparisons, while DPO optimizes the policy directly from them.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290}

Data requirements

Regardless of the specific algorithm chosen, the foundation of preference optimization relies on high-quality preference data. A usable row needs more than chosen and rejected: it needs the same rendered prompt context, a clear non-tie label, and split provenance that prevents related comparisons leaking into evaluation.

data-requirements.py

preference_example = {
    "prompt_id": "access-review-17",
    "prompt": "A user requests temporary admin access for a migration.",
    "chosen": "Open an access-review ticket and cite policy P-7 before approval.",
    "rejected": "Grant admin for tonight and ask the user to clean it up tomorrow.",
    "label": "chosen"
}

required_fields = {"prompt_id", "prompt", "chosen", "rejected", "label"}
fields_present = required_fields <= preference_example.keys()
distinct_responses = preference_example["chosen"] != preference_example["rejected"]
binary_label = preference_example["label"] in {"chosen", "rejected"}

print(f"fields_present={fields_present}")
print(f"distinct_responses={distinct_responses}")
print(f"binary_label={binary_label}")
print("next_gate=group prompt_id before train/eval split")

Preference triple output

fields_present=True
distinct_responses=True
binary_label=True
next_gate=group prompt_id before train/eval split

Annotation quality guidelines

Poorly annotated data can lead to undesirable behavior or failure to generalize the intended policy. That's why preference annotation needs a clear rubric, disagreement handling, and held-out checks.

Factor	Good Practice	Bad Practice
Annotator agreement	Clear rubric, adjudication on disagreements	Ambiguous rubric with unresolved disagreement
Preference clarity	Keep labeler-consistent judgments, including hard pairs	Treat unresolved ties or disagreement as clean binary labels
Diversity	Cover edge cases, refusals, creativity	Only easy instructions
Source and split provenance	Record generators; group related pairs by prompt before splitting	Mix related comparisons across train and evaluation

Preference data quality flow showing raw pairs reviewed through rubric clarity, agreement, diversity, and drift checks before splitting into training signal versus annotation artifacts. — Preference data quality is a filtering problem. Clear rubrics, grouped splits, diverse cases, and fresh-policy checks separate usable signal from annotation artifacts.

Preference data from another generator isn't automatically invalid. Risk is coverage mismatch: labels may compare styles target policy rarely produces while missing its real failure modes. Record each generator, evaluate on fresh target-policy outputs, and include current-policy samples when shift matters.

prompt_group_split.py

pairs = [
    {"prompt_id": "p1", "pair_id": "a-b"},
    {"prompt_id": "p1", "pair_id": "a-c"},
    {"prompt_id": "p2", "pair_id": "d-e"},
]
eval_prompts = {"p1"}
train_prompts = {row["prompt_id"] for row in pairs if row["prompt_id"] not in eval_prompts}
overlap = train_prompts & eval_prompts

assert not overlap
print(f"eval_pairs={sum(row['prompt_id'] in eval_prompts for row in pairs)}")
print(f"train_pairs={sum(row['prompt_id'] in train_prompts for row in pairs)}")
print(f"prompt_overlap={sorted(overlap)}")

Prompt-group split output

eval_pairs=2
train_pairs=1
prompt_overlap=[]

While data quality matters more than raw count, dataset size still changes what the model can learn.

What scale changes

A narrow domain can improve from a relatively small, carefully curated preference set.
A general assistant needs much broader coverage across refusals, tool use, reasoning, style, and failure cases.
If annotators keep disagreeing, more labels can quantify uncertainty, but they won't turn an ambiguous rubric into clear training direction.

When alignment backfires

The most subtle failure mode in alignment is reward hacking: the model learns to exploit reward model weaknesses rather than genuinely improving. This happens because the reward model is merely a proxy for human preference, not a perfect representation of it.

When the language model is optimized against this proxy, it can discover edge cases where the proxy assigns high scores to degenerate outputs. The proxy becomes the target, and the target can stop representing true preference.

For example, if the reward model overweights length or surface formatting, policy optimization can amplify those traits faster than actual answer quality. The visible symptom is rising proxy reward without a matching improvement in held-out human preference.

Reward Hack	Symptom	Detection
Length gaming	Responses get progressively longer	Track avg response length during training
Sycophancy	Model agrees with user even when wrong	Test with incorrect user statements
Format exploitation	Model wraps everything in markdown/lists for higher scores	Compare formatted vs plain-text scores
Confidence mimicry	Model sounds confident even when uncertain	Test calibration on known-hard questions

The verbosity bias

Aligned models can become wordy when human annotators equate length with quality. A long, rambling access-policy explanation might score higher than a concise, accurate one because it looks more thorough. If your reward data has this bias, the policy learns to pad responses.

Symptom: Average response length grows steadily during training while helpfulness scores plateau.
Fix: Add length-matched preference evaluations and annotate concise versus padded answers explicitly. Length penalties or normalization are interventions to validate, because they can also punish necessary detail.

Ambiguous labels and the `0.693` trap

If the policy begins equal to its reference, every DPO relative margin starts at zero and the loss starts near -log(0.5) = 0.693. That's expected, not evidence that an individual pair provides no gradient. At zero logit, the binary loss has its largest directional gradient magnitude.

Ambiguous pairs still hurt: if different annotators or duplicate prompts point in conflicting directions, their updates can cancel in aggregate or teach arbitrary style preferences. Diagnose data quality from agreement, slices, and held-out behavior, not from initial loss alone.

dpo_zero_margin_is_not_zero_gradient.py

from math import exp, log

def sigmoid(value):
    return 1 / (1 + exp(-value))

for logit in [0.0, 2.0]:
    loss = -log(sigmoid(logit))
    gradient = sigmoid(logit) - 1
    print(f"logit={logit:.1f} loss={loss:.3f} gradient={gradient:.3f}")

DPO zero-margin output

logit=0.0 loss=0.693 gradient=-0.500
logit=2.0 loss=0.127 gradient=-0.119

Symptom: Loss remains near 0.693 after meaningful training and held-out preference metrics don't improve.
Fix: Check that the policy is updating, then inspect contradictory labels, unresolved ambiguity, prompt leakage, and missing task coverage before changing beta.

Mitigation strategies

Use several defensive techniques to keep proxy improvement tied to real behavior:

Hold out human eval data: Check whether higher proxy reward also improves real human preference wins
Track auxiliary metrics: Monitor length, refusal rate, formatting drift, and calibration rather than trusting one scalar score
KL monitoring and stop rules: Track divergence from the reference model, tune or adapt the KL coefficient, and stop runs that move outside the tested drift budget
Refresh the scorer: Re-label or retrain the reward signal when current policy outputs drift too far from the scorer's training data

alignment_release_gate.py

metrics = {
    "proxy_reward_delta": 0.31,
    "held_out_human_win_delta": -0.04,
    "mean_length_delta_tokens": 48,
}
ready = metrics["held_out_human_win_delta"] > 0 and metrics["mean_length_delta_tokens"] < 20

print(f"proxy_improved={metrics['proxy_reward_delta'] > 0}")
print(f"human_preference_improved={metrics['held_out_human_win_delta'] > 0}")
print(f"release_ready={ready}")

Alignment release gate output

proxy_improved=True
human_preference_improved=False
release_ready=False

Alignment diagnostics showing proxy reward rising while held-out human preference falls, plus opposing DPO pair gradients that nearly cancel. — Healthy-looking reward can hide two failures: proxy and outcome can diverge, and contradictory labels can cancel pair updates. Release gates need held-out outcomes and pair-level checks.

Online DPO and iterative alignment

Vanilla DPO is an offline algorithm: it trains on a fixed dataset of chosen and rejected responses.^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290} As the policy improves, those old comparisons become less representative of what the current model produces.

Several later papers study this online setting: sample fresh responses from the current policy, label those pairs with humans or a learned preference model, then apply an IPO-style update, or a closely related pairwise preference loss, on the new comparisons.^{[4]Reference 4Human Alignment of Large Language Models through Online Preference Optimisation.https://arxiv.org/abs/2403.08635}

Diagram showing Prompts dataset, Sample responses from current policy, Human or preference model chooses winner, and Build fresh pair chosen and rejected. — Prompts dataset, Sample responses from current policy, Human or preference model chooses winner, and Build fresh pair chosen and rejected.

Conceptually, an online DPO loop has five steps:

Sample two or more candidate responses from the current policy for each prompt.
Ask humans or a preference model to choose the better candidate.
Convert the result into a fresh (prompt, chosen, rejected) triple.
Run a DPO-style update using the chosen reference-policy strategy.
Repeat with new policy outputs so the data distribution follows the model as it changes.

There isn't one canonical "online DPO" algorithm, and reference-policy update choices differ across methods. The family resemblance matters: refresh preference data on-policy while keeping a pairwise preference loss instead of a full PPO loop.^{[4]Reference 4Human Alignment of Large Language Models through Online Preference Optimisation.https://arxiv.org/abs/2403.08635}

Online DPO advantages

Reduced distribution mismatch: The model learns from its own current outputs, not stale comparisons alone.
Harder negatives over time: As the policy improves, the rejected samples become stronger and more informative.
Keeps the DPO loss: You retain a simpler pairwise objective instead of moving all the way to PPO.

Beyond vanilla DPO

While DPO is the standard offline baseline, it still requires high-quality pairwise preference data $(y_w, y_l)$ . That data is expensive to collect and often noisy, especially when annotators disagree or the difference between two good answers is subtle. Newer methods attempt to relax this requirement, improve the optimization objective, or combine training stages to reduce pipeline complexity.

From that broader design space, several practical variants and adjacent post-training methods have become common discussion points:

Method	Key Difference
IPO (Identity Preference Optimization)	Replaces DPO's log-sigmoid loss with a squared loss on the preference margin, so it can't drive the margin to infinity. This adds explicit regularization against overfitting when preference labels are nearly deterministic.^{[5]Reference 5A General Theoretical Paradigm to Understand Learning from Human Feedback.https://arxiv.org/abs/2310.12036}
KTO (Kahneman-Tversky Optimization)	Eliminates the need for pairs entirely. It works with binary feedback (thumbs up/down) for individual outputs, treating alignment as maximizing the utility of "good" outputs while minimizing "bad" ones.^{[6]Reference 6KTO: Model Alignment as Prospect Theoretic Optimization.https://arxiv.org/abs/2402.01306}
ORPO (Odds Ratio Preference Optimization)	Combines SFT and preference alignment into a single training stage, removing the need for a separate reference model.^{[7]Reference 7ORPO: Monolithic Preference Optimization without Reference Model.https://arxiv.org/abs/2403.07691}
SimPO (Simple Preference Optimization)	Drops the reference model and uses length-normalized average log-probability as the implicit reward, plus a target margin. Its objective directly addresses sequence-length dependence; the authors tune a larger `beta` range than DPO in their experiments.^{[8]Reference 8SimPO: Simple Preference Optimization with a Reference-Free Rewardhttps://arxiv.org/abs/2405.14734}
GRPO (Group Relative Policy Optimization)	Not an offline preference-optimization loss. It's an online RL method that compares sampled outputs for the same prompt and uses group-relative advantages instead of a learned critic. Reward signals may be programmatic or learned; DeepSeek-R1 is a prominent verifier-heavy reasoning example.^{[9]Reference 9DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}^{[10]Reference 10GRPO Trainer.https://huggingface.co/docs/trl/grpo_trainer}

IPO, KTO, ORPO, and SimPO all still live in the offline preference-optimization family. GRPO sits on a different branch: it's online reinforcement learning, not a DPO variant. Programmatic correctness checks are one high-value reward source, not a requirement of the algorithm.

For teams optimizing subjective assistant behavior from fixed pairs, DPO is a useful baseline candidate. Binary feedback rather than strict pairs points toward KTO. Tight memory budgets make ORPO and SimPO attractive because they avoid a separate frozen reference model. When DPO overfits near-deterministic labels, IPO tests a bounded-margin alternative. If the task has strong programmatic verifiers, online RL methods such as GRPO become worth evaluating.

grpo_group_relative_advantage.py

from statistics import mean, pstdev

def group_advantages(rewards):
    center = mean(rewards)
    scale = pstdev(rewards)
    if scale == 0:
        return center, scale, [0.0 for _ in rewards]
    return center, scale, [(reward - center) / scale for reward in rewards]

reward_groups = {
    "mixed": [1.0, 0.0, 0.5, 1.0],
    "all_equal": [1.0, 1.0, 1.0, 1.0],
}

for name, rewards in reward_groups.items():
    center, scale, advantages = group_advantages(rewards)
    print(
        f"{name}: mean={center:.3f} std={scale:.3f} "
        f"advantages={[round(value, 3) for value in advantages]}"
    )

Group-relative advantage output

mixed: mean=0.625 std=0.415 advantages=[0.905, -1.508, -0.302, 0.905]
all_equal: mean=1.000 std=0.000 advantages=[0.0, 0.0, 0.0, 0.0]

This simplified calculation needs the zero-variance guard. If every sampled answer receives the same reward, the group has no relative learning signal and its variance is zero. Current TRL exposes frac_reward_zero_std, the fraction of generation-batch samples whose reward standard deviation is zero. A high value means the rewards aren't distinguishing among sampled answers for many prompts, so inspect verifier coverage or sampling before assuming the optimizer is learning from useful comparisons.^{[10]Reference 10GRPO Trainer.https://huggingface.co/docs/trl/grpo_trainer}

Modern post-training family map

These names get mixed together too easily. Some are optimizers, some are reward-model families, and some are full pipeline patterns.

Family	Type	Training signal	Best fit
DPO	offline preference optimization	chosen vs rejected pairs	baseline candidate for offline subjective preference pairs
IPO	offline preference optimization	chosen vs rejected pairs plus squared-loss margin	when preference labels are near-deterministic and DPO overfits^{[5]Reference 5A General Theoretical Paradigm to Understand Learning from Human Feedback.https://arxiv.org/abs/2310.12036}
ORPO	offline preference optimization	chosen vs rejected pairs plus odds-ratio objective	when you want SFT and preference learning in one stage^{[7]Reference 7ORPO: Monolithic Preference Optimization without Reference Model.https://arxiv.org/abs/2403.07691}
KTO	offline preference optimization	binary good/bad labels	when telemetry or moderation labels exist but strict pairs don't^{[6]Reference 6KTO: Model Alignment as Prospect Theoretic Optimization.https://arxiv.org/abs/2402.01306}
SimPO	offline preference optimization	chosen vs rejected pairs, reference-free, length-normalized	when reference-model memory is tight and length bias is a concern^{[8]Reference 8SimPO: Simple Preference Optimization with a Reference-Free Rewardhttps://arxiv.org/abs/2405.14734}
GRPO	online RL	grouped sampled outputs plus reward functions or models	tasks where online sampled comparison pays off, often with checkable outcomes^{[9]Reference 9DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}^{[10]Reference 10GRPO Trainer.https://huggingface.co/docs/trl/grpo_trainer}
ORM	reward-model family	one learned scalar score on final answer	when outcome-level reward labels are the appropriate signal
PRM	reward-model family	scalar scores on intermediate reasoning steps	search or long-chain reasoning where early bad steps should be pruned^{[11]Reference 11Let's Verify Step by Step.https://arxiv.org/abs/2305.20050}

Two clarifications keep taxonomy clean:

PRM and ORM aren't alternatives to DPO in the same sense as ORPO or KTO. They are scorers that can feed RLHF or reasoning-time search loops.
GRPO isn't a DPO variant. It's online reinforcement learning that uses group-relative advantages instead of a learned critic.^{[9]Reference 9DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

Use this shortcut:

offline subjective alignment: evaluate DPO as a baseline
offline binary feedback: consider KTO
single-stage preference plus SFT: consider ORPO
reference-free, length-bias-sensitive: consider SimPO
near-deterministic labels, DPO overfits: consider IPO
online verifiable reasoning: consider GRPO with programmatic verifier rewards

post_training_method_router.py

scenarios = {
    "paired_offline_feedback": "DPO baseline",
    "binary_offline_feedback": "KTO candidate",
    "online_checkable_reward": "GRPO candidate",
}

for signal, candidate in scenarios.items():
    print(f"{signal} -> {candidate}")

Method routing output

paired_offline_feedback -> DPO baseline
binary_offline_feedback -> KTO candidate
online_checkable_reward -> GRPO candidate

That map is also why curriculum splits reward modeling, RLHF/DPO, and RLVR into separate lessons. Names overlap in conversation, but actual training loops differ.

Constitutional AI and RLAIF

A parallel evolution in alignment reduces the amount of human labeling by replacing most pairwise annotations with AI feedback. Constitutional AI (CAI), introduced by Bai et al. (2022)^{[12]Reference 12Constitutional AI: Harmlessness from AI Feedback.https://arxiv.org/abs/2212.08073}, defines a written set of principles (a "constitution") and uses the model itself to:

Critique its own outputs against each principle
Revise the output to comply with the violated principle
Rank candidate outputs under those principles to generate preference data

In the original Constitutional AI paper, those AI critiques and revisions are used in two stages.^{[12]Reference 12Constitutional AI: Harmlessness from AI Feedback.https://arxiv.org/abs/2212.08073} First, revised responses from self-critique supervise an SFT-style phase. Then the model samples alternative answers, an AI judge ranks them under the constitution, a preference model is trained on those rankings, and an RL stage optimizes against that model.^{[12]Reference 12Constitutional AI: Harmlessness from AI Feedback.https://arxiv.org/abs/2212.08073} The core advantage is scale: you can synthesize far more feedback from a constitution than by hiring humans to label every harmful example. Bai et al. show that this can train a more harmless, less evasive assistant with far fewer human labels, though the result still depends heavily on the quality of the constitution and the judge model.^{[12]Reference 12Constitutional AI: Harmlessness from AI Feedback.https://arxiv.org/abs/2212.08073}

Constitutional AI works especially well when alignment requirements can be written as explicit rules, such as "cite retrieved sources when making factual claims" or "escalate high-risk access and safety issues instead of sounding certain." A written constitution makes the feedback policy easier to audit and update than ad hoc human-labeling instructions.

When alignment breaks

Reward rises while human preference gets worse. Symptom: offline reward looks better, but held-out human wins flatten or drop. Fix: keep human eval held out, watch length and formatting drift, cap KL movement, and refresh the reward signal when policy outputs shift.
DPO loss stays near 0.693 after updates. Symptom: held-out pair quality also fails to improve. Fix: first verify gradients and optimizer steps, then audit contradictory labels, ties, leakage, and coverage rather than treating initialization loss as a diagnosis.
DPO improves style but misses safety boundaries. Symptom: outputs sound polished yet still cross policy lines. Fix: add targeted safety pairs, refusal-quality checks, and policy-specific evals instead of assuming generic preference data covers safety.
PPO updates make the model incoherent. Symptom: reward rises, but generations get unstable or repetitive. Fix: stop promotion, examine reward scale and drift, then tune KL/update size and refresh evaluation together rather than trusting proxy reward.

Practice checkpoints

What strong answers show

Foundational: Outline the RLHF pipeline from SFT to reward modeling to PPO.
Foundational: Explain why PPO needs a reference policy, reward model, value model, and KL constraint.
Intermediate: Compute a Bradley-Terry preference probability and a DPO loss from small log-probability numbers.
Intermediate: Compare RLHF's online rollout loop with DPO's offline pairwise objective.
Advanced: Diagnose reward hacking, length bias, weak preference pairs, and overtraining symptoms.
Advanced: Explain when KTO, ORPO, GRPO, ORM, and PRM belong in the post-training map.
Advanced: Explain how Constitutional AI and RLAIF reduce human labeling pressure while introducing judge-quality dependence.

What to remember

Why alignment: Instruction following doesn't equate to helpful, safe, or honest behavior.
RLHF pipeline: SFT to reward model to PPO. A common PPO setup coordinates four model roles during training.
Reward model: Trained on pairwise preference comparisons to predict human judgments.
PPO: Maximizes reward while keeping the model close to the reference using a KL constraint.
DPO: Under its KL-regularized preference-model assumptions, optimizes an implicit reward without a separate reward model or PPO loop.
Tradeoffs: RLHF supports online exploration but has more moving parts; DPO has a simpler offline loop but still depends on data and objective choices.
Modern variants: KTO and ORPO change offline preference tuning requirements, while GRPO is often used for online training with verifiable rewards.
Constitutional AI: RLAIF scales alignment by replacing much of the human labeling with principled AI feedback.
Production reality: Evaluate DPO as an offline baseline when pair data fits the task. Evaluate online methods when fresh sampling or verifier feedback is central.

Follow-up questions

Try this quick exercise before moving on. Suppose you have a DPO training batch with the following log probabilities (all base-e logs):

Response	Policy log-prob	Reference log-prob
Chosen (A)	-6.2	-6.5
Rejected (B)	-7.8	-7.1

With $\beta = 0.1$ , compute the DPO loss by hand. Then answer: is the margin positive (the policy already prefers chosen over rejected relative to the reference) or negative?

Next Step

Continue to Constitutional AI & Red Teaming

There, you'll understand how Constitutional AI reduces reliance on repeated human preference labeling through AI critique and ranking, and how automated red teaming stress-tests those safeguards.

PreviousReward Modeling from Preference Data

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

Proximal Policy Optimization Algorithms.

Schulman, J., et al. · 2017

Human Alignment of Large Language Models through Online Preference Optimisation.

Calandriello, D., et al. · 2024

A General Theoretical Paradigm to Understand Learning from Human Feedback.

Azar, M. G., Rowland, M., et al. · 2023

KTO: Model Alignment as Prospect Theoretic Optimization.

Ethayarajh, K., et al. · 2024 · ICML 2024

ORPO: Monolithic Preference Optimization without Reference Model.

Hong, J., Lee, N., & Thorne, J. · 2024

SimPO: Simple Preference Optimization with a Reference-Free Reward

Meng, Y., et al. · 2024

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

GRPO Trainer.

Hugging Face · 2026

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Constitutional AI: Harmlessness from AI Feedback.

Bai, Y., et al. · 2022 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.