Train reward models as a first-class post-training stage: validate chosen/rejected pairs and splits, fit a scalar reward head with Bradley-Terry loss, audit generalization, and decide when explicit rewards are worth the extra complexity.
LoRA adapts a model's behavior cheaply. Preference alignment starts from the next training problem: once a model can answer, how do we teach it which answer people prefer?
Reinforcement Learning from Human Feedback (RLHF) diagrams often make reward modeling look trivial: collect preferences, train reward model, run Proximal Policy Optimization (PPO). In practice, the reward model is its own training project. If it learns the wrong shortcuts, policy optimization will happily amplify them.[1]
Start by isolating that stage. Before you think about PPO, Group Relative Policy Optimization (GRPO), or online exploration, explain what a reward model sees, what loss it optimizes, what metrics it logs, and how it fails.
A scalar reward model doesn't generate text. It scores text.
Given a prompt x and two candidate answers:
y+ chosen by the labelery- rejected by the labelerthe reward model should assign:
1r(x, y+) > r(x, y-)That scalar score is later useful in two different ways:
Explicit reward models remain relevant even though Direct Preference Optimization (DPO) can skip them for offline preference optimization.[2]
The core supervision format is a preference pair.
1{
2 "prompt": "User requests temporary admin access for a migration. What should the assistant do?",
3 "chosen": "Open an access-review ticket and cite policy P-7 before approval.",
4 "rejected": "Grant admin for tonight and ask the user to clean it up tomorrow."
5}1{
2 "chosen": [
3 {"role": "user", "content": "User requests temporary admin access for a migration. What should the assistant do?"},
4 {"role": "assistant", "content": "Open an access-review ticket and cite policy P-7 before approval."}
5 ],
6 "rejected": [
7 {"role": "user", "content": "User requests temporary admin access for a migration. What should the assistant do?"},
8 {"role": "assistant", "content": "Grant admin for tonight and ask the user to clean it up tomorrow."}
9 ]
10}Hugging Face TRL supports both standard and conversational preference formats and can apply the model's chat template automatically during reward-model training.[3]
A row isn't ready merely because it has chosen and rejected columns. For every binary preference pair, enforce:
That last field is important for splitting. If one prompt generated several candidates, its comparisons are near-duplicates. Put the entire prompt or candidate-generation group in either train or evaluation, never both.
1from collections import Counter
2
3pairs = [
4 {"id": "a", "prompt_left": "admin?", "prompt_right": "admin?", "chosen": "Escalate.", "rejected": "Approve.", "label": "chosen"},
5 {"id": "b", "prompt_left": "admin?", "prompt_right": "admin?", "chosen": "Escalate.", "rejected": "Escalate.", "label": "chosen"},
6 {"id": "c", "prompt_left": "admin?", "prompt_right": "source?", "chosen": "Escalate.", "rejected": "Cite source.", "label": "chosen"},
7 {"id": "d", "prompt_left": "source?", "prompt_right": "source?", "chosen": "Cite.", "rejected": "Refuse.", "label": "tie"},
8]
9
10def rejection_reason(pair):
11 if pair["label"] != "chosen":
12 return "tie_or_abstention"
13 if pair["prompt_left"] != pair["prompt_right"]:
14 return "context_mismatch"
15 if pair["chosen"] == pair["rejected"]:
16 return "identical_candidates"
17 return None
18
19reasons = Counter(reason for pair in pairs if (reason := rejection_reason(pair)))
20kept = [pair["id"] for pair in pairs if rejection_reason(pair) is None]
21
22print(f"kept={kept}")
23print(f"rejected={dict(sorted(reasons.items()))}")1kept=['a']
2rejected={'context_mismatch': 1, 'identical_candidates': 1, 'tie_or_abstention': 1}1pairs = [
2 {"pair_id": "a-b", "prompt_id": "support-17"},
3 {"pair_id": "a-c", "prompt_id": "support-17"},
4 {"pair_id": "d-e", "prompt_id": "safety-04"},
5 {"pair_id": "f-g", "prompt_id": "access-09"},
6]
7eval_prompt_ids = {"support-17"}
8
9train = [pair for pair in pairs if pair["prompt_id"] not in eval_prompt_ids]
10evaluation = [pair for pair in pairs if pair["prompt_id"] in eval_prompt_ids]
11overlap = {pair["prompt_id"] for pair in train} & {pair["prompt_id"] for pair in evaluation}
12
13assert not overlap
14print(f"train_pairs={len(train)} eval_pairs={len(evaluation)}")
15print(f"prompt_overlap={sorted(overlap)}")1train_pairs=2 eval_pairs=2
2prompt_overlap=[]Most practical reward models aren't built from scratch. You start from a pretrained or SFT checkpoint and attach a scalar sequence-level head. Conceptually:
1tokens | transformer hidden states | sequence representation | one scalar rewardFor a decoder-only LM, that representation is often taken from the final non-padding position, then passed through a one-unit score head. Reward modeling feels like sequence classification with pairwise labels rather than generation. The output is one number per candidate response, not the next token distribution.
The rendered sequence is part of the contract. Keep chat-template and end-of-sequence conventions consistent with the policy being evaluated. Also don't silently train on answers whose decisive ending was truncated: current TRL RewardConfig.max_length filters a pair when either candidate exceeds the configured maximum after tokenization.[3]
1max_length = 1024
2pairs = [
3 {"id": "fits", "chosen_tokens": 412, "rejected_tokens": 390},
4 {"id": "chosen_too_long", "chosen_tokens": 1088, "rejected_tokens": 401},
5 {"id": "rejected_too_long", "chosen_tokens": 288, "rejected_tokens": 1030},
6]
7
8kept = [p["id"] for p in pairs if max(p["chosen_tokens"], p["rejected_tokens"]) <= max_length]
9dropped = [p["id"] for p in pairs if p["id"] not in kept]
10
11print(f"kept={kept}")
12print(f"dropped_instead_of_truncated={dropped}")1kept=['fits']
2dropped_instead_of_truncated=['chosen_too_long', 'rejected_too_long']The classic formulation says the probability that the chosen response wins depends on the reward difference:
Start with a tiny comparison. Suppose the policy-correct access answer gets reward 1.8, while the unsupported approval gets 0.7. The margin is 1.8 - 0.7 = 1.1. A positive margin means the reward model prefers the helpful answer. The Bradley-Terry model turns that margin into a preference probability: .
where σ is the sigmoid function.[1]
The loss maximizes the log-probability of the observed preference:
If the chosen reward is much higher than the rejected reward, loss becomes small. If the model ranks them backwards, loss becomes large.
1import torch
2import torch.nn.functional as F
3
4chosen_rewards = torch.tensor([2.1, 0.8, 1.9])
5rejected_rewards = torch.tensor([0.4, 1.0, 1.2])
6
7margins = chosen_rewards - rejected_rewards
8loss = -F.logsigmoid(margins).mean()
9accuracy = (margins > 0).float().mean()
10
11print("margins:", [round(float(x), 3) for x in margins])
12print("reward_loss:", round(float(loss), 4))
13print("pair_accuracy:", round(float(accuracy), 4))1margins: [1.7, -0.2, 0.7]
2reward_loss: 0.4564
3pair_accuracy: 0.66671from math import exp, log1p
2
3for margin in [-2.0, 0.0, 2.0]:
4 loss = log1p(exp(-margin))
5 print(f"margin={margin:+.1f} loss={loss:.4f}")1margin=-2.0 loss=2.1269
2margin=+0.0 loss=0.6931
3margin=+2.0 loss=0.1269That's the ranking core. Everything else in reward modeling is about making sure the data and evaluation around that loss are strong enough to be trusted.
The TRL reward-model guide logs more than loss for a reason.[3]
Useful metrics include:
r(chosen) - r(rejected)Loss alone isn't enough. A reward model can lower training loss by overfitting to easy stylistic cues that don't hold up under real policy outputs.
Reward models are underdetermined up to an additive constant: adding the same number to every score leaves every margin and the Bradley-Terry loss unchanged. Scaling scores is different. It preserves a ranking but changes the loss confidence and the strength of a reward signal consumed by an optimizer.
Operationally, that matters because:
TRL exposes center_rewards_coefficient to encourage mean-zero rewards. It's a centering aid, not proof that reward magnitude is calibrated for policy optimization.[3]
1from math import exp, log1p
2from statistics import mean
3
4chosen = [1.2, 0.7]
5rejected = [0.2, 0.4]
6
7def pair_loss(left, right):
8 return mean(log1p(exp(-(a - b))) for a, b in zip(left, right))
9
10shifted = ([score + 10 for score in chosen], [score + 10 for score in rejected])
11scaled = ([score * 3 for score in chosen], [score * 3 for score in rejected])
12
13print(f"base_loss={pair_loss(chosen, rejected):.4f}")
14print(f"shifted_loss={pair_loss(*shifted):.4f}")
15print(f"scaled_loss={pair_loss(*scaled):.4f}")1base_loss=0.4338
2shifted_loss=0.4338
3scaled_loss=0.1949Pair accuracy can look healthy while the reward model still learns a bad shortcut. This tiny audit separates pair accuracy from length bias. The model ranks all three preference pairs correctly, but its reward is suspiciously correlated with response length.
1from statistics import mean
2
3pairs = [
4 {"chosen_reward": 1.8, "rejected_reward": 0.7, "chosen_tokens": 36, "rejected_tokens": 19},
5 {"chosen_reward": 2.4, "rejected_reward": 1.1, "chosen_tokens": 58, "rejected_tokens": 22},
6 {"chosen_reward": 1.6, "rejected_reward": 0.3, "chosen_tokens": 33, "rejected_tokens": 14},
7]
8
9margins = [row["chosen_reward"] - row["rejected_reward"] for row in pairs]
10accuracy = mean(margin > 0 for margin in margins)
11length_gaps = [row["chosen_tokens"] - row["rejected_tokens"] for row in pairs]
12
13print(f"pair_accuracy={accuracy:.2f}")
14print(f"mean_margin={mean(margins):.2f}")
15print(f"chosen_answers_longer={all(gap > 0 for gap in length_gaps)}")
16print("next_check=build length-matched eval pairs")1pair_accuracy=1.00
2mean_margin=1.23
3chosen_answers_longer=True
4next_check=build length-matched eval pairsAnnotator disagreement is another failure signal. A pair can be formatted correctly and still be weak supervision if raters don't agree about which completion is better. The toy gate below routes any disputed label to review; a real pipeline may adjudicate, weight, or retain disagreements for a dedicated evaluation slice.
1votes = {
2 "clear_safety": ["chosen", "chosen", "chosen"],
3 "style_only": ["chosen", "rejected", "chosen"],
4 "ambiguous_refusal": ["chosen", "rejected", "tie"],
5}
6
7for pair_id, labels in votes.items():
8 chosen_share = labels.count("chosen") / len(labels)
9 status = "train" if chosen_share == 1.0 else "review_or_hold_out"
10 print(f"{pair_id}: chosen_share={chosen_share:.2f} status={status}")1clear_safety: chosen_share=1.00 status=train
2style_only: chosen_share=0.67 status=review_or_hold_out
3ambiguous_refusal: chosen_share=0.33 status=review_or_hold_outThe core question isn't whether the reward model fits the training pairs. It's whether, when the current policy produces fresh responses, the reward model still ranks them the way humans would. That's the distribution-shift problem.
As the policy improves, it starts producing answers unlike the ones in the original preference dataset. The reward model may then score confidently for the wrong reasons. This is one path to reward hacking, a concrete instance of Goodhart's law: once a proxy metric (the learned reward) becomes the optimization target, it can stop tracking the thing you cared about (real human preference).[4]
If you don't monitor that gap, policy optimization can amplify the shortcut. The threshold below is illustrative; set release gates from your evaluation design and risk tolerance.
1evaluation = {
2 "static_held_out_pairs": {"accuracy": 0.92, "human_reviewed": False},
3 "fresh_policy_pairs": {"accuracy": 0.64, "human_reviewed": True},
4}
5minimum_fresh_accuracy = 0.80
6ppo_ready = evaluation["fresh_policy_pairs"]["accuracy"] >= minimum_fresh_accuracy
7
8print(f"static_accuracy={evaluation['static_held_out_pairs']['accuracy']:.2f}")
9print(f"fresh_accuracy={evaluation['fresh_policy_pairs']['accuracy']:.2f}")
10print(f"ppo_ready={ppo_ready}")1static_accuracy=0.92
2fresh_accuracy=0.64
3ppo_ready=FalseA common control, developed in full by the next lesson, penalizes the policy for drifting too far from its reference checkpoint. Optimization maximizes reward minus a KL-divergence term measuring policy drift.[1] This discourages large departures from the reference policy, but doesn't certify that the reward model is valid on new outputs. Keep the claim narrow: a reward model is a local approximation of human preference, not a global truth, which is exactly why it can be overoptimized.
Held-out pairs you wrote yourself can share your blind spots. RewardBench is an Ai2 benchmark that scores a reward model by how often it ranks a known-better completion above a worse one. Its sections cover chat, harder instruction-following comparisons, safety, reasoning, and prior preference test sets.[5] RewardBench 2 reports a harder multi-skill, best-of-four evaluation using mostly previously unused human prompts and verifies no overlap with the downstream evaluations it compares against. In its experiments, benchmark scores correlate with best-of-N performance and provide a useful signal for PPO, rather than only measuring static pair accuracy.[6]
One practical caveat from that work: the highest-scoring reward model on the leaderboard isn't automatically the best choice for your run. In the paper's PPO experiments, reward models based on the same model lineage as the policy transferred better than mismatched choices. Treat absolute benchmark rank as a filter, then validate with the policy and optimization setup you'll use.[6]
The scalar Bradley-Terry head is a common baseline, but the space has widened.
These don't retire the scalar reward model, and they don't all share its Bradley-Terry objective. The reusable lesson is the evaluation discipline: inspect the signal's coverage, test it on outputs produced by the system being optimized, and watch for optimization exploiting its blind spots.
Use one when:
Start with DPO when:
That trade-off is why DPO is a strong offline-preference baseline: it removes the separate reward-model training stage.[2] But it doesn't supply a reusable scalar judge for PPO or candidate scoring. This final toy gate combines several checks, with project-specific thresholds, to block optimization when only one static slice passes.
1checks = {
2 "grouped_split_has_no_prompt_overlap": True,
3 "length_matched_accuracy": 0.84,
4 "fresh_policy_human_accuracy": 0.78,
5 "minimum_required_accuracy": 0.80,
6}
7ready = (
8 checks["grouped_split_has_no_prompt_overlap"]
9 and checks["length_matched_accuracy"] >= checks["minimum_required_accuracy"]
10 and checks["fresh_policy_human_accuracy"] >= checks["minimum_required_accuracy"]
11)
12
13print(f"static_slice_passes={checks['length_matched_accuracy'] >= checks['minimum_required_accuracy']}")
14print(f"fresh_policy_passes={checks['fresh_policy_human_accuracy'] >= checks['minimum_required_accuracy']}")
15print(f"optimize_against_reward={ready}")1static_slice_passes=True
2fresh_policy_passes=False
3optimize_against_reward=FalseCheck that you can:
Because held-out pairs can still look like the training distribution. PPO changes the policy, so the reward model must rank fresh policy outputs well, not original preference pairs alone. You still need shortcut audits, fresh-sample checks, and an external benchmark or human comparison before trusting the reward signal.
Use an explicit reward model when you need PPO-style online optimization, candidate reranking, or an inspectable scalar preference signal inside a larger training loop. Evaluate DPO as the simpler offline baseline when you have a clean preference dataset and don't need a separate learned judge.
Build length-matched evaluation pairs and rescore fresh policy outputs. That reveals whether the model learned a real quality signal or a "longer is better" shortcut. If the gap remains, add human checks and adversarial probes before more policy optimization.
Start with RewardBench as a broad filter, but validate transfer on your own policy and training setup when PPO is the real target. RewardBench 2's same-lineage result is evidence from its tested setups, not a universal selection rule.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
Reward Modeling.
Hugging Face · 2026
Scaling Laws for Reward Model Overoptimization
Gao, L., Schulman, J., & Hilton, J. · 2023
RewardBench: Evaluating Reward Models for Language Modeling
Lambert, N., et al. · 2024
RewardBench 2: Advancing Reward Model Evaluation
Malik, S., et al. · 2025
Generative Reward Models
Mahan, D., et al. · 2024
Let's Verify Step by Step.
Lightman, H., et al. · 2023 · ICLR
Tülu 3: Pushing Frontiers in Open Language Model Post-Training
Lambert, N., et al. · 2024 · arXiv preprint