Understand the RLHF pipeline and DPO, including reward modeling, PPO mechanics, and the trade-offs between iterative reinforcement learning and direct preference optimization.
Reward modeling showed how to turn chosen/rejected responses into a scalar preference signal. RLHF and DPO ask what to do with that signal: train a separate reward-driven policy, or optimize the policy directly from preference pairs.
Imagine asking a customer-support AI whether a delayed order should be refunded, and it returns something fluent, confident, and wrong. Or asking about a return-policy exception and getting advice that sounds authoritative but violates company rules. InstructGPT frames this gap clearly: pretrained language models can be untruthful, toxic, or unhelpful even when their text is polished.[1]
That is the alignment problem. Pretraining optimizes next- prediction, not direct human preference. A language model optimized to predict plausible text can still produce responses that sound useful whether or not they are helpful, truthful, or safe. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) are post-training methods that push a model toward responses preferred under a labeling policy. They don't, by themselves, prove truthfulness or safety outside that policy's coverage.
After instruction tuning, a model can follow instructions. But it may still generate harmful content, be dishonest, refuse reasonable requests, or produce verbose and unhelpful answers. This misalignment happens because next-token prediction doesn't inherently enforce a product's safety rules or a labeler's preference rubric. Predicting what comes next isn't the same as satisfying the behavior being evaluated.
Alignment optimizes for human preferences beyond task completion. It bridges the gap between next-token prediction (what's probable?) and helpful interaction (what's desirable?).
To see why SFT alone isn't enough, consider a customer-support bot given this context: "Policy: offer a refund after seven calendar days without delivery. Customer: My package hasn't arrived after 10 days. Can I get a refund?" The SFT model might generate two grammatically correct responses:
Response A follows the supplied policy. Response B is unhelpful and violates that stated policy. Both can look fluent. SFT teaches the model to imitate demonstrations; preference training supplies comparative signal when those distinctions are represented in the labeled data.[1]
That preference signal is what alignment adds.
RLHF (Reinforcement Learning from Human Feedback) is the classic large-scale preference-alignment pipeline popularized by InstructGPT: start from an SFT model, train a reward model from human comparisons, then optimize the policy with reinforcement learning.[1]
The key distinction is separation of roles. SFT gives the model baseline instruction-following behavior, the reward model approximates labeled preference, and PPO updates the policy while a KL term discourages large drift from the SFT reference.
RLHF runs in three distinct steps. First, collect demonstrations and perform Supervised Fine-Tuning (SFT) to establish baseline instruction-following. Then gather comparison data to train a proxy judge (the reward model). Finally, use reinforcement learning to optimize the language model against that judge.
Think of the reward model like a support-policy auditor comparing two draft replies. The auditor doesn't need to generate the reply; it needs to score the policy-correct, helpful answer above the vague or unsafe one.
The mathematical foundation is the Bradley-Terry model, which assumes the probability that a human prefers response over depends on their underlying "reward" scores:[1][2]
This model captures the intuition that the larger the gap in quality between two responses, the more confident we are about which one a human would prefer.
Let's walk through a concrete example. Suppose the reward model scores Response A at 2.0 and Response B at 0.5 for the same refund prompt. The probability that a human prefers A over B is:
That means the model predicts an 81.8% chance the human picks A. If the human picked A, the log-likelihood of that observation is . If the model had instead predicted B > A, the log-likelihood would be . The loss pushes the reward model toward the higher probability assignment.
1import math
2
3def sigmoid(value: float) -> float:
4 return 1 / (1 + math.exp(-value))
5
6reward_a = 2.0
7reward_b = 0.5
8preference_probability = sigmoid(reward_a - reward_b)
9negative_log_likelihood = -math.log(preference_probability)
10
11print(f"P(A preferred over B) = {preference_probability:.1%}")
12print(f"NLL = {negative_log_likelihood:.3f}")1P(A preferred over B) = 81.8%
2NLL = 0.201The reward model is trained to predict which of two responses a human would prefer. The loss function maximizes the log-likelihood of the observed preferences:
where is the preferred (winning) response and is the rejected (losing) response for the same prompt .
Once we have a reward model, we treat the language model as a policy . We use Proximal Policy Optimization (PPO), a reinforcement learning algorithm designed to make clipped, conservative policy updates[3], to maximize the expected reward. The objective maximizes reward while penalizing deviation from the reference policy:
Think of the Kullback-Leibler (KL) term as a drift budget. The policy may seek higher learned reward, while deviations from the reference cost reward. This can reduce exposure to regions where the reward model is unreliable; it can't prove the reward is correct or prevent every exploitable shortcut. InstructGPT applies a per-token KL penalty to mitigate reward-model overoptimization.[1]
To see why drift control matters, imagine the policy starts generating refund responses that begin with "I am an AI assistant designed to help with refunds." If the reward model accidentally scores that opening higher because it correlates with politeness in training data, the policy can amplify it. A KL term charges departures from the reference, while held-out human checks still determine whether behavior improved.
In the common conceptual PPO-style RLHF setup, one update coordinates four model roles. Implementations can shard, offload, or share parts of these roles, so this is a compute-and-state picture rather than a universal memory layout:
The calculation below isolates the shaped reward for sampled responses. It isn't a PPO trainer: a full implementation also estimates advantages, clips updates, and applies token-level masks.
1samples = [
2 {"name": "near_reference", "reward": 1.20, "policy_logp": -2.1, "ref_logp": -2.2},
3 {"name": "high_drift", "reward": 1.35, "policy_logp": -1.2, "ref_logp": -2.8},
4]
5beta = 0.20
6
7for sample in samples:
8 sampled_log_ratio = sample["policy_logp"] - sample["ref_logp"]
9 shaped_reward = sample["reward"] - beta * sampled_log_ratio
10 print(f"{sample['name']}: raw={sample['reward']:.2f} shaped={shaped_reward:.2f}")1near_reference: raw=1.20 shaped=1.18
2high_drift: raw=1.35 shaped=1.03The higher raw score isn't automatically the preferred optimization target once policy drift is charged. In production, PPO implementations compute token-level KL penalties, estimate returns with a value model and often GAE, and separate collection from policy updates. The point here is the system shape: PPO-style RLHF generates fresh responses and coordinates policy, reference, reward, and value roles.[1][3]
Despite its power, the RLHF pipeline introduces practical challenges that make it expensive to run and hard to debug at scale.
The first problem is systems overhead. Standard PPO-style RLHF needs actor, reference, reward, and critic/value roles during a training cycle. Sharding and offload change resident memory, but not the computation and coordination burden. Each rollout batch also begins with current-policy generation before updates, so throughput becomes partly an inference problem.
Reinforcement learning also adds more knobs than supervised fine-tuning. PPO is sensitive to reward scale, the KL coefficient (beta), clipping threshold (epsilon), value-loss weighting, advantage estimation, and rollout quality. Weak drift control increases exposure to reward-model exploitation; overly strong control can suppress useful updates.[1][3]
The reward model itself is a silent failure point. It was trained on a finite collection of human comparisons. When the improving policy starts generating responses outside that distribution, the proxy scores become unreliable. That is when overoptimization appears: the numeric reward keeps climbing while real human preference scores plateau or drop. In a customer-support setting, you might see the model start adding long disclaimers or repetitive apologies because those patterns happened to score high in the limited preference data.
| Issue | Description |
|---|---|
| Complexity | Four logical model roles (policy, reference, reward, value), with memory affected by sharding and offload |
| Tuning surface | PPO depends on clip ratio, learning rate, value-loss coefficients, reward scale, and sample quality |
| Reward hacking | Policy exploits reward model weaknesses when it generates out-of-distribution outputs |
| Coverage gaps | Policy can regress on behaviors that are weakly represented in the reward data |
| Cost | Human preference annotation at scale is expensive; PPO training is compute-intensive |
PPO-style RLHF has a multi-role online loop. DPO instead targets fixed preference pairs with a simpler objective. It removes engineering components; it doesn't supply online exploration or guarantee the same outcome as every RLHF run.
At first glance, DPO looks like a completely different alignment recipe. The key mathematical result in the DPO paper is narrower: under a KL-regularized reward-maximization objective and a Bradley-Terry-style preference model, an implicit reward can be parameterized through policy-to-reference log-probability ratios.[2] That relationship yields an offline pairwise loss without training a standalone reward model or running PPO during DPO training.
Here's the "aha" moment in plain English. Under those modeling assumptions, an implicit reward is encoded by how policy log-probabilities change relative to a reference model. DPO can therefore train directly on labeled pairs. It learns only from the coverage and quality of those comparisons unless you build a separate data-refresh loop.
The implicit reward is:
Where is the optimal policy for the stated objective, is the reference model, is tied to the KL-regularization trade-off and scales the DPO logit, and is a normalization term that depends only on prompt . DPO then trains a policy from pairwise preferences under this parameterization.[2]
By substituting this relationship back into the preference loss, DPO derives a loss function that depends only on the policy and reference model:
Where is the preferred response and is the rejected response for the same prompt , and is the reference model that provides the baseline distribution. The reference model defines the relative log-probability baseline and shapes the divergence trade-off; it isn't proof that outputs stay fluent, calibrated, or safe.
Let's trace the loss with actual numbers. Suppose for our refund prompt we have:
Step by step:
The loss pushes the policy to increase that 0.7 margin. If the policy becomes more confident on the chosen response (say, log-prob rises to -7.5 while the rejected stays similar), the margin grows, the sigmoid probability approaches 1.0, and the loss drops toward 0. In gradient terms, the update increases the likelihood of the chosen answer relative to the reference while decreasing the likelihood of the rejected answer, all scaled by how far the current policy already deviates from the reference.
1from math import exp, log
2
3chosen_policy, chosen_reference = -8.2, -8.5
4rejected_policy, rejected_reference = -9.5, -9.1
5beta = 0.1
6margin = (chosen_policy - chosen_reference) - (rejected_policy - rejected_reference)
7logit = beta * margin
8loss = -log(1 / (1 + exp(-logit)))
9
10print(f"relative_margin={margin:.1f}")
11print(f"scaled_logit={logit:.2f}")
12print(f"dpo_loss={loss:.3f}")1relative_margin=0.7
2scaled_logit=0.07
3dpo_loss=0.659The implementation of the DPO loss is straightforward. The runnable example below uses a tiny causal language model so the tensor shapes are real without downloading an external checkpoint. A production implementation would add padding masks, attention masks, distributed training, and tokenizer-specific details.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4from types import SimpleNamespace
5
6class TinyCausalLM(nn.Module):
7 def __init__(self, vocab_size: int = 16, hidden_size: int = 8):
8 super().__init__()
9 self.embedding = nn.Embedding(vocab_size, hidden_size)
10 self.projection = nn.Linear(hidden_size, vocab_size)
11
12 def forward(self, input_ids: torch.Tensor) -> SimpleNamespace:
13 hidden = self.embedding(input_ids)
14 return SimpleNamespace(logits=self.projection(hidden))
15
16def compute_log_probs(model, prompt_ids, response_ids) -> torch.Tensor:
17 input_ids = torch.cat([prompt_ids, response_ids], dim=-1)
18 logits = model(input_ids).logits
19 response_logits = logits[:, prompt_ids.shape[-1] - 1:-1, :]
20 log_probs = F.log_softmax(response_logits, dim=-1)
21 token_log_probs = torch.gather(
22 log_probs, dim=-1, index=response_ids.unsqueeze(-1)
23 ).squeeze(-1)
24 return token_log_probs.sum(dim=-1)
25
26def dpo_loss(
27 policy_model: torch.nn.Module,
28 ref_model: torch.nn.Module,
29 batch: dict[str, torch.Tensor],
30 beta: float = 0.1
31) -> torch.Tensor:
32 pi_w = compute_log_probs(policy_model, batch["prompt"], batch["chosen"])
33 pi_l = compute_log_probs(policy_model, batch["prompt"], batch["rejected"])
34
35 with torch.no_grad():
36 ref_w = compute_log_probs(ref_model, batch["prompt"], batch["chosen"])
37 ref_l = compute_log_probs(ref_model, batch["prompt"], batch["rejected"])
38
39 logits = beta * ((pi_w - ref_w) - (pi_l - ref_l))
40 return -F.logsigmoid(logits).mean()
41
42torch.manual_seed(0)
43policy = TinyCausalLM()
44reference = TinyCausalLM()
45reference.load_state_dict(policy.state_dict())
46for param in reference.parameters():
47 param.requires_grad_(False)
48
49batch = {
50 "prompt": torch.tensor([[1, 2], [1, 3]]),
51 "chosen": torch.tensor([[4, 5], [4, 6]]),
52 "rejected": torch.tensor([[7, 8], [7, 9]]),
53}
54
55with torch.no_grad():
56 policy.projection.bias[5] += 1.0
57 policy.projection.bias[6] += 1.0
58
59loss = dpo_loss(policy, reference, batch)
60loss_is_scalar = loss.ndim == 0
61loss_is_finite = bool(torch.isfinite(loss))
62
63loss.backward()
64grad_norm = sum(
65 param.grad.abs().sum().item()
66 for param in policy.parameters()
67 if param.grad is not None
68)
69
70print(f"DPO loss: {loss.item():.3f}")
71print(f"policy_grad_norm: {grad_norm:.3f}")
72print(f"loss_is_scalar={loss_is_scalar}")
73print(f"loss_is_finite={loss_is_finite}")
74print(f"grad_norm_positive={grad_norm > 0}")1DPO loss: 0.648
2policy_grad_norm: 1.784
3loss_is_scalar=True
4loss_is_finite=True
5grad_norm_positive=TrueVisible feedback: At initialization, a policy copied from its reference naturally produces logits near zero and loss near
0.693. After training begins, monitor margins together with held-out preference quality and output regressions. Near-zero logits alone don't diagnose weak pairs or a badbeta.
In production, DPO implementations also mask padding tokens and explicitly track the chosen-vs-rejected log-probability margin during training. Teams usually monitor response length too, because whole-sequence log-probability sums can create length bias when chosen and rejected responses have systematically different lengths.
Vanilla DPO training is operationally simpler than PPO-style RLHF. It applies an offline pairwise loss instead of coordinating rollouts, a learned reward model, and a value model. A reference model provides baseline log-probabilities while the active policy is updated from preference data.
DPO turns preference alignment into a supervised pairwise classification loss. You still need a frozen reference model and high-quality chosen/rejected pairs, but you no longer train a separate reward model or value function.
To better understand the trade-offs between the traditional RLHF pipeline and DPO, compare what each method optimizes and what it has to keep alive during training.[1][2]
| Feature | RLHF (PPO) | DPO |
|---|---|---|
| Reward model | Train a separate reward model | No separate reward model; reward is implicit in the objective |
| Training loop | On-policy RL with rollouts | Offline pairwise loss on fixed comparisons |
| Logical model roles | Policy + ref + reward + critic | Policy + reference |
| Most expensive step | Sampling and scoring fresh responses | Forward/backward passes on fixed preference pairs |
| Stability | Sensitive to reward scale, KL weight, and PPO settings | Fewer interacting loops, but still sensitive to data and beta |
| Online exploration | Yes | No, not in vanilla DPO |
| Data dependence | Can keep collecting fresh comparisons during training | Limited by current preference coverage unless data is refreshed |
| Monitoring need | Reward drift, KL drift, value loss, rollout quality | Preference loss, margin growth, length bias |
DPO lowers system complexity and is a practical candidate baseline for offline pairwise preferences. That doesn't mean PPO is obsolete. If a team needs online exploration, fresh model-generated negatives, or a reward signal that changes as the model improves, RL-style alignment has capabilities fixed-dataset DPO doesn't.
Whether a team uses RLHF or DPO, preference-signal quality sets a major performance ceiling. In practice, collecting and cleaning that signal is often the slowest and most expensive part of the pipeline. Both methods start from pairwise preference data, but they use it differently: RLHF trains a reward model on those comparisons, while DPO optimizes the policy directly on the same comparisons.[1][2]
Regardless of the specific algorithm chosen, the foundation of preference optimization relies on high-quality preference data. A usable row needs more than chosen and rejected: it needs the same rendered prompt context, a clear non-tie label, and split provenance that prevents related comparisons leaking into evaluation.
1preference_example = {
2 "prompt_id": "damaged-order-17",
3 "prompt": "A customer says their order arrived damaged. What should I do?",
4 "chosen": "I'm sorry to hear that. I can start a replacement or refund right away. "
5 "Which would you prefer?",
6 "rejected": "Things sometimes get damaged in shipping. You can file a claim "
7 "with the carrier if you want.",
8 "label": "chosen"
9}
10
11required_fields = {"prompt_id", "prompt", "chosen", "rejected", "label"}
12fields_present = required_fields <= preference_example.keys()
13distinct_responses = preference_example["chosen"] != preference_example["rejected"]
14binary_label = preference_example["label"] in {"chosen", "rejected"}
15
16print(f"fields_present={fields_present}")
17print(f"distinct_responses={distinct_responses}")
18print(f"binary_label={binary_label}")
19print("next_gate=group prompt_id before train/eval split")1fields_present=True
2distinct_responses=True
3binary_label=True
4next_gate=group prompt_id before train/eval splitPoorly annotated data can lead to undesirable behavior or failure to generalize the intended policy. That is why preference annotation needs a clear rubric, disagreement handling, and held-out checks.
| Factor | Good Practice | Bad Practice |
|---|---|---|
| Annotator agreement | Clear rubric, adjudication on disagreements | Ambiguous rubric with unresolved disagreement |
| Preference clarity | Keep labeler-consistent judgments, including hard pairs | Treat unresolved ties or disagreement as clean binary labels |
| Diversity | Cover edge cases, refusals, creativity | Only easy instructions |
| Source and split provenance | Record generators; group related pairs by prompt before splitting | Mix related comparisons across train and evaluation |
Preference data from another generator isn't automatically invalid. The risk is coverage mismatch: labels may compare styles or capabilities the target policy rarely produces, while ignoring its actual failure modes. Record each generator, evaluate on fresh target-policy outputs, and include current-policy samples when distribution shift matters.
1pairs = [
2 {"prompt_id": "p1", "pair_id": "a-b"},
3 {"prompt_id": "p1", "pair_id": "a-c"},
4 {"prompt_id": "p2", "pair_id": "d-e"},
5]
6eval_prompts = {"p1"}
7train_prompts = {row["prompt_id"] for row in pairs if row["prompt_id"] not in eval_prompts}
8overlap = train_prompts & eval_prompts
9
10assert not overlap
11print(f"eval_pairs={sum(row['prompt_id'] in eval_prompts for row in pairs)}")
12print(f"train_pairs={sum(row['prompt_id'] in train_prompts for row in pairs)}")
13print(f"prompt_overlap={sorted(overlap)}")1eval_pairs=2
2train_pairs=1
3prompt_overlap=[]While data quality matters more than raw count, dataset size still changes what the model can learn.
The most subtle failure mode in alignment is reward hacking: the model learns to exploit reward model weaknesses rather than genuinely improving. This happens because the reward model is merely a proxy for human preference, not a perfect representation of it.
When the language model is optimized against this proxy, it can discover edge cases where the proxy assigns high scores to degenerate outputs. The proxy becomes the target, and the target can stop representing true preference.
For example, if the reward model overweights length or surface formatting, policy optimization can amplify those traits faster than actual answer quality. The visible symptom is rising proxy reward without a matching improvement in held-out human preference.
| Reward Hack | Symptom | Detection |
|---|---|---|
| Length gaming | Responses get progressively longer | Track avg response length during training |
| Sycophancy | Model agrees with user even when wrong | Test with incorrect user statements |
| Format exploitation | Model wraps everything in markdown/lists for higher scores | Compare formatted vs plain-text scores |
| Confidence mimicry | Model sounds confident even when uncertain | Test calibration on known-hard questions |
Aligned models can become wordy when human annotators equate length with quality. A long, rambling refund explanation might score higher than a concise, accurate one because it looks more thorough. If your reward data has this bias, the policy learns to pad responses.
Symptom: Average response length grows steadily during training while helpfulness scores plateau. Fix: Add length-matched preference evaluations and annotate concise versus padded answers explicitly. Length penalties or normalization are interventions to validate, because they can also punish necessary detail.
0.693 trapIf the policy begins equal to its reference, every DPO relative margin starts at zero and the loss starts near -log(0.5) = 0.693. That is expected, not evidence that an individual pair provides no gradient. At zero logit, the binary loss has its largest directional gradient magnitude.
Ambiguous pairs still hurt: if different annotators or duplicate prompts point in conflicting directions, their updates can cancel in aggregate or teach arbitrary style preferences. Diagnose data quality from agreement, slices, and held-out behavior, not from initial loss alone.
1from math import exp, log
2
3def sigmoid(value):
4 return 1 / (1 + exp(-value))
5
6for logit in [0.0, 2.0]:
7 loss = -log(sigmoid(logit))
8 gradient = sigmoid(logit) - 1
9 print(f"logit={logit:.1f} loss={loss:.3f} gradient={gradient:.3f}")1logit=0.0 loss=0.693 gradient=-0.500
2logit=2.0 loss=0.127 gradient=-0.119Symptom: Loss remains near 0.693 after meaningful training and held-out preference metrics don't improve.
Fix: Check that the policy is updating, then inspect contradictory labels, unresolved ambiguity, prompt leakage, and missing task coverage before changing beta.
To combat reward hacking and ensure the model genuinely improves, alignment engineers employ several defensive techniques:
1metrics = {
2 "proxy_reward_delta": 0.31,
3 "held_out_human_win_delta": -0.04,
4 "mean_length_delta_tokens": 48,
5}
6ready = metrics["held_out_human_win_delta"] > 0 and metrics["mean_length_delta_tokens"] < 20
7
8print(f"proxy_improved={metrics['proxy_reward_delta'] > 0}")
9print(f"human_preference_improved={metrics['held_out_human_win_delta'] > 0}")
10print(f"release_ready={ready}")1proxy_improved=True
2human_preference_improved=False
3release_ready=False
Vanilla DPO is an offline algorithm: it trains on a fixed dataset of chosen and rejected responses.[2] As the policy improves, those old comparisons become less representative of what the current model produces.
Several later papers study this online setting: sample fresh responses from the current policy, label those pairs with humans or a learned preference model, then apply an IPO-style update, or a closely related pairwise preference loss, on the new comparisons.[4]
Conceptually, an online DPO loop has five steps:
(prompt, chosen, rejected) triple.There isn't one canonical "online DPO" algorithm, and reference-policy update choices differ across methods. The family resemblance is the important part: refresh preference data on-policy while keeping a pairwise preference loss instead of a full PPO loop.[4]
While DPO is the standard offline baseline, it still requires high-quality pairwise preference data . That data is expensive to collect and often noisy, especially when annotators disagree or the difference between two good answers is subtle. Newer methods attempt to relax this requirement, improve the optimization objective, or combine training stages to reduce pipeline complexity.
From that broader design space, several practical variants and adjacent post-training methods have become common discussion points:
| Method | Key Difference |
|---|---|
| IPO (Identity Preference Optimization) | Replaces DPO's log-sigmoid loss with a squared loss on the preference margin, so it can't drive the margin to infinity. This adds explicit regularization against overfitting when preference labels are nearly deterministic.[5] |
| KTO (Kahneman-Tversky Optimization) | Eliminates the need for pairs entirely. It works with binary feedback (thumbs up/down) for individual outputs, treating alignment as maximizing the utility of "good" outputs while minimizing "bad" ones.[6] |
| ORPO (Odds Ratio Preference Optimization) | Combines SFT and preference alignment into a single training stage, removing the need for a separate reference model.[7] |
| SimPO (Simple Preference Optimization) | Drops the reference model and uses length-normalized average log-probability as the implicit reward, plus a target margin. Its objective directly addresses sequence-length dependence; the authors tune a larger beta range than DPO in their experiments.[8] |
| GRPO (Group Relative Policy Optimization) | Not an offline preference-optimization loss. It is an online RL method that compares sampled outputs for the same prompt and uses group-relative advantages instead of a learned critic. Reward signals may be programmatic or learned; DeepSeek-R1 is a prominent verifier-heavy reasoning example.[9][10] |
IPO, KTO, ORPO, and SimPO all still live in the offline preference-optimization family. GRPO sits on a different branch: it is online reinforcement learning, not a DPO variant. Programmatic correctness checks are one high-value reward source, not a requirement of the algorithm.
For teams optimizing subjective assistant behavior from fixed pairs, DPO is a useful baseline candidate. If you only have binary feedback rather than strict pairs, KTO is designed for that shape. If memory is tight, ORPO and SimPO avoid a separate frozen reference model. If DPO overfits near-deterministic labels, IPO tests a bounded-margin alternative. If the task has strong programmatic verifiers, online RL methods such as GRPO become worth evaluating.
1from statistics import mean, pstdev
2
3rewards = [1.0, 0.0, 0.5, 1.0]
4center = mean(rewards)
5scale = pstdev(rewards)
6advantages = [(reward - center) / scale for reward in rewards]
7
8print(f"group_mean={center:.3f}")
9print(f"advantages={[round(value, 3) for value in advantages]}")
10print(f"advantage_mean={mean(advantages):.3f}")1group_mean=0.625
2advantages=[0.905, -1.508, -0.302, 0.905]
3advantage_mean=0.000These names get mixed together too easily. Some are optimizers, some are reward-model families, and some are full pipeline patterns.
| Family | Type | Training signal | Best fit |
|---|---|---|---|
| DPO | offline preference optimization | chosen vs rejected pairs | baseline candidate for offline subjective preference pairs |
| IPO | offline preference optimization | chosen vs rejected pairs plus squared-loss margin | when preference labels are near-deterministic and DPO overfits[5] |
| ORPO | offline preference optimization | chosen vs rejected pairs plus odds-ratio objective | when you want SFT and preference learning in one stage[7] |
| KTO | offline preference optimization | binary good/bad labels | when telemetry or moderation labels exist but strict pairs don't[6] |
| SimPO | offline preference optimization | chosen vs rejected pairs, reference-free, length-normalized | when reference-model memory is tight and length bias is a concern[8] |
| GRPO | online RL | grouped sampled outputs plus reward functions or models | tasks where online sampled comparison is valuable, often with checkable outcomes[9][10] |
| ORM | reward-model family | one learned scalar score on final answer | when outcome-level reward labels are the appropriate signal |
| PRM | reward-model family | scalar scores on intermediate reasoning steps | search or long-chain reasoning where early bad steps should be pruned[11] |
Two clarifications keep taxonomy clean:
Use this shortcut:
1scenarios = {
2 "paired_offline_feedback": "DPO baseline",
3 "binary_offline_feedback": "KTO candidate",
4 "online_checkable_reward": "GRPO candidate",
5}
6
7for signal, candidate in scenarios.items():
8 print(f"{signal} -> {candidate}")1paired_offline_feedback -> DPO baseline
2binary_offline_feedback -> KTO candidate
3online_checkable_reward -> GRPO candidateThat map is also why curriculum splits reward modeling, RLHF/DPO, and RLVR into separate lessons. Names overlap in conversation, but actual training loops differ.
A parallel evolution in alignment reduces the amount of human labeling by replacing most pairwise annotations with AI feedback. Constitutional AI (CAI), introduced by Bai et al. (2022)[12], defines a written set of principles (a "constitution") and uses the model itself to:
In the original Constitutional AI paper, those AI critiques and revisions are used in two stages.[12] First, revised responses from self-critique supervise an SFT-style phase. Then the model samples alternative answers, an AI judge ranks them under the constitution, a preference model is trained on those rankings, and an RL stage optimizes against that model.[12] The core advantage is scale: you can synthesize far more feedback from a constitution than by hiring humans to label every harmful example. Bai et al. show that this can train a more harmless, less evasive assistant with far fewer human labels, though the result still depends heavily on the quality of the constitution and the judge model.[12]
Constitutional AI is especially valuable when alignment requirements can be written as explicit rules, such as "cite retrieved sources when making factual claims" or "escalate high-risk refund and safety issues instead of sounding certain." A written constitution makes the feedback policy easier to audit and update than ad hoc human-labeling instructions.
0.693 after updates. Symptom: held-out pair quality also fails to improve. Fix: first verify gradients and optimizer steps, then audit contradictory labels, ties, leakage, and coverage rather than treating initialization loss as a diagnosis.Try this quick exercise before moving on. Suppose you have a DPO training batch with the following log probabilities (all base-e logs):
| Response | Policy log-prob | Reference log-prob |
|---|---|---|
| Chosen (A) | -6.2 | -6.5 |
| Rejected (B) | -7.8 | -7.1 |
With , compute the DPO loss by hand. Then answer: is the margin positive (the policy already prefers chosen over rejected relative to the reference) or negative?
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
Proximal Policy Optimization Algorithms.
Schulman, J., et al. · 2017
Human Alignment of Large Language Models through Online Preference Optimisation.
Calandriello, D., et al. · 2024
A General Theoretical Paradigm to Understand Learning from Human Feedback.
Azar, M. G., Rowland, M., et al. · 2023
KTO: Model Alignment as Prospect Theoretic Optimization.
Ethayarajh, K., et al. · 2024 · ICML 2024
ORPO: Monolithic Preference Optimization without Reference Model.
Hong, J., Lee, N., & Thorne, J. · 2024
SimPO: Simple Preference Optimization with a Reference-Free Reward
Meng, Y., et al. · 2024
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI · 2025
GRPO Trainer.
Hugging Face · 2026
Let's Verify Step by Step.
Lightman, H., et al. · 2023 · ICLR
Constitutional AI: Harmlessness from AI Feedback.
Bai, Y., et al. · 2022 · arXiv preprint