LearnAdvanced Training & AdaptationReward Modeling from Preference Data

🛡️HardAlignment & Safety

Reward Modeling from Preference Data

Train reward models as a first-class post-training stage: validate chosen/rejected pairs and splits, fit a scalar reward head with Bradley-Terry loss, audit generalization, and decide when explicit rewards are worth the extra complexity.

20 min read

Learning path

Step 104 of 158 in the full curriculum

LoRA & Parameter-Efficient Tuning RLHF & DPO Alignment

LoRA adapts a model's behavior cheaply. Preference alignment starts from the next training problem: once a model can answer, how do we teach it which answer people prefer?

Reinforcement Learning from Human Feedback (RLHF) diagrams often make reward modeling look trivial: collect preferences, train reward model, run Proximal Policy Optimization (PPO). In practice, the reward model is its own training project. If it learns the wrong shortcuts, policy optimization will happily amplify them.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

Start by isolating that stage. Before you think about PPO, Group Relative Policy Optimization (GRPO), or online exploration, explain what a reward model sees, what loss it optimizes, what metrics it logs, and how it fails.

Reward-modeling flow from validated chosen versus rejected preference pairs to scalar rewards, margin comparison, and fresh-output trust checks before optimization. — Reward modeling is a standalone stage. Preference pairs become scalar scores, and fresh-output checks determine whether downstream optimization should trust that signal.

What reward modeling is trying to learn

A scalar reward model doesn't generate text. It scores text.

Given a prompt x and two candidate answers:

y+ chosen by the labeler
y- rejected by the labeler

the reward model should assign:

text

r(x, y+) > r(x, y-)

That scalar score is later useful in two different ways:

as a scalar reward signal for PPO-style RLHF^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}
as an inspectable ranking signal when you want to compare policy outputs

Explicit reward models remain relevant even though Direct Preference Optimization (DPO) can skip them for offline preference optimization.^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290}

What the dataset looks like

The core supervision format is a preference pair.

Standard format

preference_pair.json

{
  "prompt": "User requests temporary admin access for a migration. What should the assistant do?",
  "chosen": "Open an access-review ticket and cite policy P-7 before approval.",
  "rejected": "Grant admin for tonight and ask the user to clean it up tomorrow."
}

Conversational format

chat_preference_pair.json

{
  "chosen": [
    {"role": "user", "content": "User requests temporary admin access for a migration. What should the assistant do?"},
    {"role": "assistant", "content": "Open an access-review ticket and cite policy P-7 before approval."}
  ],
  "rejected": [
    {"role": "user", "content": "User requests temporary admin access for a migration. What should the assistant do?"},
    {"role": "assistant", "content": "Grant admin for tonight and ask the user to clean it up tomorrow."}
  ]
}

Hugging Face TRL supports both standard and conversational preference formats and can apply the model's chat template automatically during reward-model training.^{[3]Reference 3Reward Modeling.https://huggingface.co/docs/trl/reward_trainer}

A pair contract before training

A row isn't ready merely because it has chosen and rejected columns. For every binary preference pair, enforce:

the same prompt, system message, tool context, and rendering template on both candidates
two different candidate answers and a definite preference label
separate handling for ties, abstentions, and ambiguous or low-agreement labels
provenance such as source prompt, generator checkpoint, sampling settings, and labeling batch

That last field is important for splitting. If one prompt generated several candidates, its comparisons are near-duplicates. Put the entire prompt or candidate-generation group in either train or evaluation, never both.

preference_pair_contract.py

from collections import Counter

pairs = [
    {"id": "a", "prompt_left": "admin?", "prompt_right": "admin?", "chosen": "Escalate.", "rejected": "Approve.", "label": "chosen"},
    {"id": "b", "prompt_left": "admin?", "prompt_right": "admin?", "chosen": "Escalate.", "rejected": "Escalate.", "label": "chosen"},
    {"id": "c", "prompt_left": "admin?", "prompt_right": "source?", "chosen": "Escalate.", "rejected": "Cite source.", "label": "chosen"},
    {"id": "d", "prompt_left": "source?", "prompt_right": "source?", "chosen": "Cite.", "rejected": "Refuse.", "label": "tie"},
]

def rejection_reason(pair):
    if pair["label"] != "chosen":
        return "tie_or_abstention"
    if pair["prompt_left"] != pair["prompt_right"]:
        return "context_mismatch"
    if pair["chosen"] == pair["rejected"]:
        return "identical_candidates"
    return None

reasons = Counter(reason for pair in pairs if (reason := rejection_reason(pair)))
kept = [pair["id"] for pair in pairs if rejection_reason(pair) is None]

print(f"kept={kept}")
print(f"rejected={dict(sorted(reasons.items()))}")

Pair-contract audit output

kept=['a']
rejected={'context_mismatch': 1, 'identical_candidates': 1, 'tie_or_abstention': 1}

grouped_preference_split.py

pairs = [
    {"pair_id": "a-b", "prompt_id": "support-17"},
    {"pair_id": "a-c", "prompt_id": "support-17"},
    {"pair_id": "d-e", "prompt_id": "safety-04"},
    {"pair_id": "f-g", "prompt_id": "access-09"},
]
eval_prompt_ids = {"support-17"}

train = [pair for pair in pairs if pair["prompt_id"] not in eval_prompt_ids]
evaluation = [pair for pair in pairs if pair["prompt_id"] in eval_prompt_ids]
overlap = {pair["prompt_id"] for pair in train} & {pair["prompt_id"] for pair in evaluation}

assert not overlap
print(f"train_pairs={len(train)} eval_pairs={len(evaluation)}")
print(f"prompt_overlap={sorted(overlap)}")

Grouped split output

train_pairs=2 eval_pairs=2
prompt_overlap=[]

The usual architecture

Most practical reward models aren't built from scratch. You start from a pretrained or SFT checkpoint and attach a scalar sequence-level head. Conceptually:

text

tokens | transformer hidden states | sequence representation | one scalar reward

For a decoder-only LM, that representation is often taken from the final non-padding position, then passed through a one-unit score head. Reward modeling feels like sequence classification with pairwise labels rather than generation. The output is one number per candidate response, not the next token distribution.

The rendered sequence is part of the contract. Keep chat-template and end-of-sequence conventions consistent with the policy being evaluated. Also don't silently train on answers whose decisive ending was truncated: current TRL RewardConfig.max_length filters a pair when either candidate exceeds the configured maximum after tokenization.^{[3]Reference 3Reward Modeling.https://huggingface.co/docs/trl/reward_trainer}

sequence_length_gate.py

max_length = 1024
pairs = [
    {"id": "fits", "chosen_tokens": 412, "rejected_tokens": 390},
    {"id": "chosen_too_long", "chosen_tokens": 1088, "rejected_tokens": 401},
    {"id": "rejected_too_long", "chosen_tokens": 288, "rejected_tokens": 1030},
]

kept = [p["id"] for p in pairs if max(p["chosen_tokens"], p["rejected_tokens"]) <= max_length]
dropped = [p["id"] for p in pairs if p["id"] not in kept]

print(f"kept={kept}")
print(f"dropped_instead_of_truncated={dropped}")

Sequence-length gate output

kept=['fits']
dropped_instead_of_truncated=['chosen_too_long', 'rejected_too_long']

Bradley-Terry loss in one page

The classic formulation says the probability that the chosen response wins depends on the reward difference:

Start with a tiny comparison. Suppose the policy-correct access answer gets reward 1.8, while the unsupported approval gets 0.7. The margin is 1.8 - 0.7 = 1.1. A positive margin means the reward model prefers the helpful answer. The Bradley-Terry model turns that margin into a preference probability: $\sigma(1.1) \approx 0.75$ .

P(y^+ \succ y^- \mid x) = \sigma\big(r(x, y^+) - r(x, y^-)\big)

where σ is the sigmoid function.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

The loss maximizes the log-probability of the observed preference:

\mathcal{L} = -\log \sigma\big(r(x, y^+) - r(x, y^-)\big)

If the chosen reward is much higher than the rejected reward, loss becomes small. If the model ranks them backwards, loss becomes large.

reward_model_loss.py

import torch
import torch.nn.functional as F

chosen_rewards = torch.tensor([2.1, 0.8, 1.9])
rejected_rewards = torch.tensor([0.4, 1.0, 1.2])

margins = chosen_rewards - rejected_rewards
loss = -F.logsigmoid(margins).mean()
accuracy = (margins > 0).float().mean()

print("margins:", [round(float(x), 3) for x in margins])
print("reward_loss:", round(float(loss), 4))
print("pair_accuracy:", round(float(accuracy), 4))

Reward loss output

margins: [1.7, -0.2, 0.7]
reward_loss: 0.4564
pair_accuracy: 0.6667

reward_margin_curve.py

from math import exp, log1p

for margin in [-2.0, 0.0, 2.0]:
    loss = log1p(exp(-margin))
    print(f"margin={margin:+.1f} loss={loss:.4f}")

Margin curve output

margin=-2.0 loss=2.1269
margin=+0.0 loss=0.6931
margin=+2.0 loss=0.1269

That's the ranking core. Everything else in reward modeling is about making sure the data and evaluation around that loss are strong enough to be trusted.

What you should monitor during training

The TRL reward-model guide logs more than loss for a reason.^{[3]Reference 3Reward Modeling.https://huggingface.co/docs/trl/reward_trainer}

Useful metrics include:

pair accuracy: how often chosen beats rejected
margin: average r(chosen) - r(rejected)
mean/min/max reward: catch drift or exploding scale
gradient norm: catch unstable updates
held-out preference quality: does the ranking still match human judgment off the train split?

Loss alone isn't enough. A reward model can lower training loss by overfitting to easy stylistic cues that don't hold up under real policy outputs.

Centering and calibration

Reward models are underdetermined up to an additive constant: adding the same number to every score leaves every margin and the Bradley-Terry loss unchanged. Scaling scores is different. It preserves a ranking but changes the loss confidence and the strength of a reward signal consumed by an optimizer.

Operationally, that matters because:

absolute reward level and score scale can drift over training
PPO-style optimization becomes sensitive to reward scale
long verbose answers can look better than they are if the reward model learned a shallow heuristic

TRL exposes center_rewards_coefficient to encourage mean-zero rewards. It's a centering aid, not proof that reward magnitude is calibrated for policy optimization.^{[3]Reference 3Reward Modeling.https://huggingface.co/docs/trl/reward_trainer}

reward_centering_invariance.py

from math import exp, log1p
from statistics import mean

chosen = [1.2, 0.7]
rejected = [0.2, 0.4]

def pair_loss(left, right):
    return mean(log1p(exp(-(a - b))) for a, b in zip(left, right))

shifted = ([score + 10 for score in chosen], [score + 10 for score in rejected])
scaled = ([score * 3 for score in chosen], [score * 3 for score in rejected])

print(f"base_loss={pair_loss(chosen, rejected):.4f}")
print(f"shifted_loss={pair_loss(*shifted):.4f}")
print(f"scaled_loss={pair_loss(*scaled):.4f}")

Centering invariance output

base_loss=0.4338
shifted_loss=0.4338
scaled_loss=0.1949

A tiny reward audit

Pair accuracy can look healthy while the reward model still learns a bad shortcut. This tiny audit separates pair accuracy from length bias. The model ranks all three preference pairs correctly, but its reward is suspiciously correlated with response length.

reward_audit.py

from statistics import mean

pairs = [
    {"chosen_reward": 1.8, "rejected_reward": 0.7, "chosen_tokens": 36, "rejected_tokens": 19},
    {"chosen_reward": 2.4, "rejected_reward": 1.1, "chosen_tokens": 58, "rejected_tokens": 22},
    {"chosen_reward": 1.6, "rejected_reward": 0.3, "chosen_tokens": 33, "rejected_tokens": 14},
]

margins = [row["chosen_reward"] - row["rejected_reward"] for row in pairs]
accuracy = mean(margin > 0 for margin in margins)
length_gaps = [row["chosen_tokens"] - row["rejected_tokens"] for row in pairs]

print(f"pair_accuracy={accuracy:.2f}")
print(f"mean_margin={mean(margins):.2f}")
print(f"chosen_answers_longer={all(gap > 0 for gap in length_gaps)}")
print("next_check=build length-matched eval pairs")

Reward audit output

pair_accuracy=1.00
mean_margin=1.23
chosen_answers_longer=True
next_check=build length-matched eval pairs

Annotator disagreement is another failure signal. A pair can be formatted correctly and still be weak supervision if raters don't agree about which completion is better. The toy gate below routes any disputed label to review; a real pipeline may adjudicate, weight, or retain disagreements for a dedicated evaluation slice.

agreement_audit.py

votes = {
    "clear_safety": ["chosen", "chosen", "chosen"],
    "style_only": ["chosen", "rejected", "chosen"],
    "ambiguous_refusal": ["chosen", "rejected", "tie"],
}

for pair_id, labels in votes.items():
    chosen_share = labels.count("chosen") / len(labels)
    status = "train" if chosen_share == 1.0 else "review_or_hold_out"
    print(f"{pair_id}: chosen_share={chosen_share:.2f} status={status}")

Agreement audit output

clear_safety: chosen_share=1.00 status=train
style_only: chosen_share=0.67 status=review_or_hold_out
ambiguous_refusal: chosen_share=0.33 status=review_or_hold_out

The real evaluation question

The core question isn't whether the reward model fits the training pairs. It's whether, when the current policy produces fresh responses, the reward model still ranks them the way humans would. That's the distribution-shift problem.

As the policy improves, it starts producing answers unlike the ones in the original preference dataset. The reward model may then score confidently for the wrong reasons. This is one path to reward hacking, a concrete instance of Goodhart's law: once a proxy metric (the learned reward) becomes the optimization target, it can stop tracking the thing you cared about (real human preference).^{[4]Reference 4Scaling Laws for Reward Model Overoptimizationhttps://arxiv.org/abs/2210.10760}

the reward rises
held-out human preference stops rising
human raters see longer, repetitive, or otherwise worse answers

If you don't monitor that gap, policy optimization can amplify the shortcut. The threshold below is illustrative; set release gates from your evaluation design and risk tolerance.

fresh_policy_gate.py

evaluation = {
    "static_held_out_pairs": {"accuracy": 0.92, "human_reviewed": False},
    "fresh_policy_pairs": {"accuracy": 0.64, "human_reviewed": True},
}
minimum_fresh_accuracy = 0.80
ppo_ready = evaluation["fresh_policy_pairs"]["accuracy"] >= minimum_fresh_accuracy

print(f"static_accuracy={evaluation['static_held_out_pairs']['accuracy']:.2f}")
print(f"fresh_accuracy={evaluation['fresh_policy_pairs']['accuracy']:.2f}")
print(f"ppo_ready={ppo_ready}")

Fresh-policy gate output

static_accuracy=0.92
fresh_accuracy=0.64
ppo_ready=False

KL control intuition (preview)

A common control, developed in full by the next lesson, penalizes the policy for drifting too far from its reference checkpoint. Optimization maximizes reward minus a KL-divergence term measuring policy drift.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155} This discourages large departures from the reference policy, but doesn't certify that the reward model is valid on new outputs. Keep the claim narrow: a reward model is a local approximation of human preference, not a global truth, which is exactly why it can be overoptimized.

Reward-model generalization gap flow from training pairs to fresh policy outputs, showing that rising learned reward can diverge from human preference. — Training-pair accuracy is only local fit. Trust comes from fresh policy outputs and human checks that catch when learned reward stops tracking real preference.

Standardized evaluation: RewardBench

Held-out pairs you wrote yourself can share your blind spots. RewardBench is an Ai2 benchmark that scores a reward model by how often it ranks a known-better completion above a worse one. Its sections cover chat, harder instruction-following comparisons, safety, reasoning, and prior preference test sets.^{[5]Reference 5RewardBench: Evaluating Reward Models for Language Modelinghttps://arxiv.org/abs/2403.13787} RewardBench 2 reports a harder multi-skill, best-of-four evaluation using mostly previously unused human prompts and verifies no overlap with the downstream evaluations it compares against. In its experiments, benchmark scores correlate with best-of-N performance and provide a useful signal for PPO, rather than only measuring static pair accuracy.^{[6]Reference 6RewardBench 2: Advancing Reward Model Evaluationhttps://arxiv.org/abs/2506.01937}

One practical caveat from that work: the highest-scoring reward model on the leaderboard isn't automatically the best choice for your run. In the paper's PPO experiments, reward models based on the same model lineage as the policy transferred better than mismatched choices. Treat absolute benchmark rank as a filter, then validate with the policy and optimization setup you'll use.^{[6]Reference 6RewardBench 2: Advancing Reward Model Evaluationhttps://arxiv.org/abs/2506.01937}

Beyond scalar reward heads

The scalar Bradley-Terry head is a common baseline, but the space has widened.

Generative reward models (LLM-as-judge). Instead of a scalar head, an LM can read candidates and emit a verdict, optionally with a rationale. Mahan et al. report improvements from rationale generation and vote aggregation in their studied setup; these are design choices to evaluate, not universal guarantees.^{[7]Reference 7Generative Reward Modelshttps://arxiv.org/abs/2410.12832}
Process reward models (PRMs). For multi-step reasoning, scoring only the final answer is a weak signal. PRMs score each step of a chain of thought, giving denser supervision. "Let's Verify Step by Step" showed step-level supervision beating outcome-only reward models on hard math.^{[8]Reference 8Let's Verify Step by Step.https://arxiv.org/abs/2305.20050}
Verifiers and RLVR. When correctness is checkable, such as a final math answer or passing unit tests, verifiable rewards can reduce dependence on a learned preference proxy. They don't eliminate misspecified tests, partial specifications, or gaming of the verifier.^{[9]Reference 9Tülu 3: Pushing Frontiers in Open Language Model Post-Traininghttps://arxiv.org/abs/2411.15124} The next chapters cover this family.

These don't retire the scalar reward model, and they don't all share its Bradley-Terry objective. The reusable lesson is the evaluation discipline: inspect the signal's coverage, test it on outputs produced by the system being optimized, and watch for optimization exploiting its blind spots.

When an explicit reward model is worth the cost

Use one when:

you want PPO-style online optimization
you want to score many candidate outputs with one scalar model
you want to inspect and audit the preference signal directly

Start with DPO when:

you have a clean offline preference dataset
you want the simpler baseline first
you don't need an explicit learned judge in the loop

That trade-off is why DPO is a strong offline-preference baseline: it removes the separate reward-model training stage.^{[2]Reference 2Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290} But it doesn't supply a reusable scalar judge for PPO or candidate scoring. This final toy gate combines several checks, with project-specific thresholds, to block optimization when only one static slice passes.

reward_readiness_gate.py

checks = {
    "grouped_split_has_no_prompt_overlap": True,
    "length_matched_accuracy": 0.84,
    "fresh_policy_human_accuracy": 0.78,
    "minimum_required_accuracy": 0.80,
}
ready = (
    checks["grouped_split_has_no_prompt_overlap"]
    and checks["length_matched_accuracy"] >= checks["minimum_required_accuracy"]
    and checks["fresh_policy_human_accuracy"] >= checks["minimum_required_accuracy"]
)

print(f"static_slice_passes={checks['length_matched_accuracy'] >= checks['minimum_required_accuracy']}")
print(f"fresh_policy_passes={checks['fresh_policy_human_accuracy'] >= checks['minimum_required_accuracy']}")
print(f"optimize_against_reward={ready}")

Readiness-gate output

static_slice_passes=True
fresh_policy_passes=False
optimize_against_reward=False

Common pitfalls

Symptom: loss falls but held-out rankings barely improve

Cause: chosen and rejected responses are nearly equivalent, so the pair gives little ranking signal.
Fix: audit pair strength before training. Keep pairs where the preference is clear, policy-relevant, and tied to the same prompt.

Symptom: one labeler style dominates the reward model

Cause: inconsistent or narrow annotator preferences become inconsistent rewards.
Fix: measure agreement, review disagreements, and separate policy rules from personal style before training.

Symptom: PPO reward rises while human preference gets worse

Cause: the policy has found a shortcut in the reward model under distribution shift.
Fix: add fresh policy-output evaluations, length-matched checks, adversarial probes, and human preference gates before trusting the scalar reward.

Symptom: product dashboards treat reward as truth

Cause: reward is being mistaken for the business or human objective itself.
Fix: report reward beside held-out preference, refusal quality, helpfulness, safety, and downstream product metrics.

Practice checkpoints

Mastery check

Check that you can:

Explain reward modeling as its own training stage between SFT and PPO-style RLHF.
Describe both standard and conversational chosen/rejected preference formats.
Derive the Bradley-Terry loss from the chosen-minus-rejected reward margin.
Calculate pair accuracy and margin from a small batch of reward scores.
Validate binary preference pairs and split candidate groups without prompt leakage.
Explain why reward centering matters even though pairwise preferences are invariant to additive shifts, while score scale isn't.
Diagnose reward hacking as a case of Goodhart's law, and explain why a KL penalty limits drift without certifying reward validity.
Evaluate a reward model against an external benchmark like RewardBench, and explain why leaderboard rank alone doesn't pick the best model for your PPO run.
Place scalar Bradley-Terry heads next to generative judges, process reward models, and verifiable rewards without claiming they use the same loss.
Decide when DPO is the simpler baseline and when an explicit reward model is worth the extra complexity.

Evaluation rubric

Strong: You can compute pair margins, explain Bradley-Terry loss, diagnose shortcut learning under distribution shift, and choose between explicit reward modeling and DPO from requirements.
Partial: You can describe chosen/rejected data and the scalar head, but you still treat pair accuracy as enough proof the reward model is safe to optimize against.
Weak: You talk about reward as if it were ground truth, or you can't explain why PPO needs off-train checks and KL control.

Follow-up questions

Pair accuracy is 92 percent on held-out pairs. Why is that still weak evidence for PPO readiness?

Because held-out pairs can still look like the training distribution. PPO changes the policy, so the reward model must rank fresh policy outputs well, not original preference pairs alone. You still need shortcut audits, fresh-sample checks, and an external benchmark or human comparison before trusting the reward signal.

When would you pay the extra complexity cost of an explicit reward model instead of starting with DPO?

Use an explicit reward model when you need PPO-style online optimization, candidate reranking, or an inspectable scalar preference signal inside a larger training loop. Evaluate DPO as the simpler offline baseline when you have a clean preference dataset and don't need a separate learned judge.

A reward model keeps preferring longer answers even when raters say they are bloated. What exactly should you change in evaluation first?

Build length-matched evaluation pairs and rescore fresh policy outputs. That reveals whether the model learned a real quality signal or a "longer is better" shortcut. If the gap remains, add human checks and adversarial probes before more policy optimization.

RewardBench rank and same-family transfer disagree. Which one should drive your PPO choice?

Start with RewardBench as a broad filter, but validate transfer on your own policy and training setup when PPO is the real target. RewardBench 2's same-lineage result is evidence from its tested setups, not a universal selection rule.

Next Step

Continue to RLHF & DPO Alignment

You isolated the reward model and made its dataset, loss, and evaluation concrete. Next you'll plug that model back into the larger alignment picture and compare the full RLHF stack against DPO and newer preference-optimization variants.

PreviousLoRA & Parameter-Efficient Tuning

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

Reward Modeling.

Hugging Face · 2026

Scaling Laws for Reward Model Overoptimization

Gao, L., Schulman, J., & Hilton, J. · 2023

RewardBench: Evaluating Reward Models for Language Modeling

Lambert, N., et al. · 2024

RewardBench 2: Advancing Reward Model Evaluation

Malik, S., et al. · 2025

Generative Reward Models

Mahan, D., et al. · 2024

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Lambert, N., et al. · 2024 · arXiv preprint

Back to Topics

LearnAdvanced Training & AdaptationReward Modeling from Preference Data

🛡️HardAlignment & Safety

Reward Modeling from Preference Data

20 min read

Learning path

Step 104 of 158 in the full curriculum

LoRA & Parameter-Efficient Tuning RLHF & DPO Alignment

LoRA adapts a model's behavior cheaply. Preference alignment starts from the next training problem: once a model can answer, how do we teach it which answer people prefer?

What reward modeling is trying to learn

A scalar reward model doesn't generate text. It scores text.

Given a prompt x and two candidate answers:

y+ chosen by the labeler
y- rejected by the labeler

the reward model should assign:

text

r(x, y+) > r(x, y-)

That scalar score is later useful in two different ways:

as a scalar reward signal for PPO-style RLHF^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}
as an inspectable ranking signal when you want to compare policy outputs

What the dataset looks like

The core supervision format is a preference pair.

Standard format

preference_pair.json

{
  "prompt": "User requests temporary admin access for a migration. What should the assistant do?",
  "chosen": "Open an access-review ticket and cite policy P-7 before approval.",
  "rejected": "Grant admin for tonight and ask the user to clean it up tomorrow."
}

Conversational format

chat_preference_pair.json

{
  "chosen": [
    {"role": "user", "content": "User requests temporary admin access for a migration. What should the assistant do?"},
    {"role": "assistant", "content": "Open an access-review ticket and cite policy P-7 before approval."}
  ],
  "rejected": [
    {"role": "user", "content": "User requests temporary admin access for a migration. What should the assistant do?"},
    {"role": "assistant", "content": "Grant admin for tonight and ask the user to clean it up tomorrow."}
  ]
}

A pair contract before training

A row isn't ready merely because it has chosen and rejected columns. For every binary preference pair, enforce:

the same prompt, system message, tool context, and rendering template on both candidates
two different candidate answers and a definite preference label
separate handling for ties, abstentions, and ambiguous or low-agreement labels
provenance such as source prompt, generator checkpoint, sampling settings, and labeling batch

preference_pair_contract.py

from collections import Counter

pairs = [
    {"id": "a", "prompt_left": "admin?", "prompt_right": "admin?", "chosen": "Escalate.", "rejected": "Approve.", "label": "chosen"},
    {"id": "b", "prompt_left": "admin?", "prompt_right": "admin?", "chosen": "Escalate.", "rejected": "Escalate.", "label": "chosen"},
    {"id": "c", "prompt_left": "admin?", "prompt_right": "source?", "chosen": "Escalate.", "rejected": "Cite source.", "label": "chosen"},
    {"id": "d", "prompt_left": "source?", "prompt_right": "source?", "chosen": "Cite.", "rejected": "Refuse.", "label": "tie"},
]

def rejection_reason(pair):
    if pair["label"] != "chosen":
        return "tie_or_abstention"
    if pair["prompt_left"] != pair["prompt_right"]:
        return "context_mismatch"
    if pair["chosen"] == pair["rejected"]:
        return "identical_candidates"
    return None

reasons = Counter(reason for pair in pairs if (reason := rejection_reason(pair)))
kept = [pair["id"] for pair in pairs if rejection_reason(pair) is None]

print(f"kept={kept}")
print(f"rejected={dict(sorted(reasons.items()))}")

Pair-contract audit output

kept=['a']
rejected={'context_mismatch': 1, 'identical_candidates': 1, 'tie_or_abstention': 1}

grouped_preference_split.py

pairs = [
    {"pair_id": "a-b", "prompt_id": "support-17"},
    {"pair_id": "a-c", "prompt_id": "support-17"},
    {"pair_id": "d-e", "prompt_id": "safety-04"},
    {"pair_id": "f-g", "prompt_id": "access-09"},
]
eval_prompt_ids = {"support-17"}

train = [pair for pair in pairs if pair["prompt_id"] not in eval_prompt_ids]
evaluation = [pair for pair in pairs if pair["prompt_id"] in eval_prompt_ids]
overlap = {pair["prompt_id"] for pair in train} & {pair["prompt_id"] for pair in evaluation}

assert not overlap
print(f"train_pairs={len(train)} eval_pairs={len(evaluation)}")
print(f"prompt_overlap={sorted(overlap)}")

Grouped split output

train_pairs=2 eval_pairs=2
prompt_overlap=[]

The usual architecture

Most practical reward models aren't built from scratch. You start from a pretrained or SFT checkpoint and attach a scalar sequence-level head. Conceptually:

text

tokens | transformer hidden states | sequence representation | one scalar reward

sequence_length_gate.py

max_length = 1024
pairs = [
    {"id": "fits", "chosen_tokens": 412, "rejected_tokens": 390},
    {"id": "chosen_too_long", "chosen_tokens": 1088, "rejected_tokens": 401},
    {"id": "rejected_too_long", "chosen_tokens": 288, "rejected_tokens": 1030},
]

kept = [p["id"] for p in pairs if max(p["chosen_tokens"], p["rejected_tokens"]) <= max_length]
dropped = [p["id"] for p in pairs if p["id"] not in kept]

print(f"kept={kept}")
print(f"dropped_instead_of_truncated={dropped}")

Sequence-length gate output

kept=['fits']
dropped_instead_of_truncated=['chosen_too_long', 'rejected_too_long']

Bradley-Terry loss in one page

The classic formulation says the probability that the chosen response wins depends on the reward difference:

P(y^+ \succ y^- \mid x) = \sigma\big(r(x, y^+) - r(x, y^-)\big)

where σ is the sigmoid function.^{[1]Reference 1Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155}

The loss maximizes the log-probability of the observed preference:

\mathcal{L} = -\log \sigma\big(r(x, y^+) - r(x, y^-)\big)

If the chosen reward is much higher than the rejected reward, loss becomes small. If the model ranks them backwards, loss becomes large.

reward_model_loss.py

import torch
import torch.nn.functional as F

chosen_rewards = torch.tensor([2.1, 0.8, 1.9])
rejected_rewards = torch.tensor([0.4, 1.0, 1.2])

margins = chosen_rewards - rejected_rewards
loss = -F.logsigmoid(margins).mean()
accuracy = (margins > 0).float().mean()

print("margins:", [round(float(x), 3) for x in margins])
print("reward_loss:", round(float(loss), 4))
print("pair_accuracy:", round(float(accuracy), 4))

Reward loss output

margins: [1.7, -0.2, 0.7]
reward_loss: 0.4564
pair_accuracy: 0.6667

reward_margin_curve.py

from math import exp, log1p

for margin in [-2.0, 0.0, 2.0]:
    loss = log1p(exp(-margin))
    print(f"margin={margin:+.1f} loss={loss:.4f}")

Margin curve output

margin=-2.0 loss=2.1269
margin=+0.0 loss=0.6931
margin=+2.0 loss=0.1269

That's the ranking core. Everything else in reward modeling is about making sure the data and evaluation around that loss are strong enough to be trusted.

What you should monitor during training

The TRL reward-model guide logs more than loss for a reason.^{[3]Reference 3Reward Modeling.https://huggingface.co/docs/trl/reward_trainer}

Useful metrics include:

pair accuracy: how often chosen beats rejected
margin: average r(chosen) - r(rejected)
mean/min/max reward: catch drift or exploding scale
gradient norm: catch unstable updates
held-out preference quality: does the ranking still match human judgment off the train split?

Loss alone isn't enough. A reward model can lower training loss by overfitting to easy stylistic cues that don't hold up under real policy outputs.

Centering and calibration

Operationally, that matters because:

absolute reward level and score scale can drift over training
PPO-style optimization becomes sensitive to reward scale
long verbose answers can look better than they are if the reward model learned a shallow heuristic

reward_centering_invariance.py

from math import exp, log1p
from statistics import mean

chosen = [1.2, 0.7]
rejected = [0.2, 0.4]

def pair_loss(left, right):
    return mean(log1p(exp(-(a - b))) for a, b in zip(left, right))

shifted = ([score + 10 for score in chosen], [score + 10 for score in rejected])
scaled = ([score * 3 for score in chosen], [score * 3 for score in rejected])

print(f"base_loss={pair_loss(chosen, rejected):.4f}")
print(f"shifted_loss={pair_loss(*shifted):.4f}")
print(f"scaled_loss={pair_loss(*scaled):.4f}")

Centering invariance output

base_loss=0.4338
shifted_loss=0.4338
scaled_loss=0.1949

A tiny reward audit

reward_audit.py

from statistics import mean

pairs = [
    {"chosen_reward": 1.8, "rejected_reward": 0.7, "chosen_tokens": 36, "rejected_tokens": 19},
    {"chosen_reward": 2.4, "rejected_reward": 1.1, "chosen_tokens": 58, "rejected_tokens": 22},
    {"chosen_reward": 1.6, "rejected_reward": 0.3, "chosen_tokens": 33, "rejected_tokens": 14},
]

margins = [row["chosen_reward"] - row["rejected_reward"] for row in pairs]
accuracy = mean(margin > 0 for margin in margins)
length_gaps = [row["chosen_tokens"] - row["rejected_tokens"] for row in pairs]

print(f"pair_accuracy={accuracy:.2f}")
print(f"mean_margin={mean(margins):.2f}")
print(f"chosen_answers_longer={all(gap > 0 for gap in length_gaps)}")
print("next_check=build length-matched eval pairs")

Reward audit output

pair_accuracy=1.00
mean_margin=1.23
chosen_answers_longer=True
next_check=build length-matched eval pairs

agreement_audit.py

votes = {
    "clear_safety": ["chosen", "chosen", "chosen"],
    "style_only": ["chosen", "rejected", "chosen"],
    "ambiguous_refusal": ["chosen", "rejected", "tie"],
}

for pair_id, labels in votes.items():
    chosen_share = labels.count("chosen") / len(labels)
    status = "train" if chosen_share == 1.0 else "review_or_hold_out"
    print(f"{pair_id}: chosen_share={chosen_share:.2f} status={status}")

Agreement audit output

clear_safety: chosen_share=1.00 status=train
style_only: chosen_share=0.67 status=review_or_hold_out
ambiguous_refusal: chosen_share=0.33 status=review_or_hold_out

The real evaluation question

the reward rises
held-out human preference stops rising
human raters see longer, repetitive, or otherwise worse answers

If you don't monitor that gap, policy optimization can amplify the shortcut. The threshold below is illustrative; set release gates from your evaluation design and risk tolerance.

fresh_policy_gate.py

evaluation = {
    "static_held_out_pairs": {"accuracy": 0.92, "human_reviewed": False},
    "fresh_policy_pairs": {"accuracy": 0.64, "human_reviewed": True},
}
minimum_fresh_accuracy = 0.80
ppo_ready = evaluation["fresh_policy_pairs"]["accuracy"] >= minimum_fresh_accuracy

print(f"static_accuracy={evaluation['static_held_out_pairs']['accuracy']:.2f}")
print(f"fresh_accuracy={evaluation['fresh_policy_pairs']['accuracy']:.2f}")
print(f"ppo_ready={ppo_ready}")

Fresh-policy gate output

static_accuracy=0.92
fresh_accuracy=0.64
ppo_ready=False

KL control intuition (preview)

Standardized evaluation: RewardBench

Beyond scalar reward heads

The scalar Bradley-Terry head is a common baseline, but the space has widened.

Generative reward models (LLM-as-judge). Instead of a scalar head, an LM can read candidates and emit a verdict, optionally with a rationale. Mahan et al. report improvements from rationale generation and vote aggregation in their studied setup; these are design choices to evaluate, not universal guarantees.^{[7]Reference 7Generative Reward Modelshttps://arxiv.org/abs/2410.12832}
Process reward models (PRMs). For multi-step reasoning, scoring only the final answer is a weak signal. PRMs score each step of a chain of thought, giving denser supervision. "Let's Verify Step by Step" showed step-level supervision beating outcome-only reward models on hard math.^{[8]Reference 8Let's Verify Step by Step.https://arxiv.org/abs/2305.20050}
Verifiers and RLVR. When correctness is checkable, such as a final math answer or passing unit tests, verifiable rewards can reduce dependence on a learned preference proxy. They don't eliminate misspecified tests, partial specifications, or gaming of the verifier.^{[9]Reference 9Tülu 3: Pushing Frontiers in Open Language Model Post-Traininghttps://arxiv.org/abs/2411.15124} The next chapters cover this family.

When an explicit reward model is worth the cost

Use one when:

you want PPO-style online optimization
you want to score many candidate outputs with one scalar model
you want to inspect and audit the preference signal directly

Start with DPO when:

you have a clean offline preference dataset
you want the simpler baseline first
you don't need an explicit learned judge in the loop

reward_readiness_gate.py

checks = {
    "grouped_split_has_no_prompt_overlap": True,
    "length_matched_accuracy": 0.84,
    "fresh_policy_human_accuracy": 0.78,
    "minimum_required_accuracy": 0.80,
}
ready = (
    checks["grouped_split_has_no_prompt_overlap"]
    and checks["length_matched_accuracy"] >= checks["minimum_required_accuracy"]
    and checks["fresh_policy_human_accuracy"] >= checks["minimum_required_accuracy"]
)

print(f"static_slice_passes={checks['length_matched_accuracy'] >= checks['minimum_required_accuracy']}")
print(f"fresh_policy_passes={checks['fresh_policy_human_accuracy'] >= checks['minimum_required_accuracy']}")
print(f"optimize_against_reward={ready}")

Readiness-gate output

static_slice_passes=True
fresh_policy_passes=False
optimize_against_reward=False

Common pitfalls

Symptom: loss falls but held-out rankings barely improve

Cause: chosen and rejected responses are nearly equivalent, so the pair gives little ranking signal.
Fix: audit pair strength before training. Keep pairs where the preference is clear, policy-relevant, and tied to the same prompt.

Symptom: one labeler style dominates the reward model

Cause: inconsistent or narrow annotator preferences become inconsistent rewards.
Fix: measure agreement, review disagreements, and separate policy rules from personal style before training.

Symptom: PPO reward rises while human preference gets worse

Cause: the policy has found a shortcut in the reward model under distribution shift.
Fix: add fresh policy-output evaluations, length-matched checks, adversarial probes, and human preference gates before trusting the scalar reward.

Symptom: product dashboards treat reward as truth

Cause: reward is being mistaken for the business or human objective itself.
Fix: report reward beside held-out preference, refusal quality, helpfulness, safety, and downstream product metrics.

Practice checkpoints

Mastery check

Check that you can:

Explain reward modeling as its own training stage between SFT and PPO-style RLHF.
Describe both standard and conversational chosen/rejected preference formats.
Derive the Bradley-Terry loss from the chosen-minus-rejected reward margin.
Calculate pair accuracy and margin from a small batch of reward scores.
Validate binary preference pairs and split candidate groups without prompt leakage.
Explain why reward centering matters even though pairwise preferences are invariant to additive shifts, while score scale isn't.
Diagnose reward hacking as a case of Goodhart's law, and explain why a KL penalty limits drift without certifying reward validity.
Evaluate a reward model against an external benchmark like RewardBench, and explain why leaderboard rank alone doesn't pick the best model for your PPO run.
Place scalar Bradley-Terry heads next to generative judges, process reward models, and verifiable rewards without claiming they use the same loss.
Decide when DPO is the simpler baseline and when an explicit reward model is worth the extra complexity.

Evaluation rubric

Strong: You can compute pair margins, explain Bradley-Terry loss, diagnose shortcut learning under distribution shift, and choose between explicit reward modeling and DPO from requirements.
Partial: You can describe chosen/rejected data and the scalar head, but you still treat pair accuracy as enough proof the reward model is safe to optimize against.
Weak: You talk about reward as if it were ground truth, or you can't explain why PPO needs off-train checks and KL control.

Follow-up questions

Pair accuracy is 92 percent on held-out pairs. Why is that still weak evidence for PPO readiness?

When would you pay the extra complexity cost of an explicit reward model instead of starting with DPO?

A reward model keeps preferring longer answers even when raters say they are bloated. What exactly should you change in evaluation first?

RewardBench rank and same-family transfer disagree. Which one should drive your PPO choice?

Next Step

Continue to RLHF & DPO Alignment

PreviousLoRA & Parameter-Efficient Tuning

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

Reward Modeling.

Hugging Face · 2026

Scaling Laws for Reward Model Overoptimization

Gao, L., Schulman, J., & Hilton, J. · 2023

RewardBench: Evaluating Reward Models for Language Modeling

Lambert, N., et al. · 2024

RewardBench 2: Advancing Reward Model Evaluation

Malik, S., et al. · 2025

Generative Reward Models

Mahan, D., et al. · 2024

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Lambert, N., et al. · 2024 · arXiv preprint

Reward Modeling from Preference Data

What reward modeling is trying to learn

What the dataset looks like

Standard format

Conversational format

A pair contract before training

The usual architecture

Bradley-Terry loss in one page

Why does a larger positive reward margin make the Bradley-Terry loss smaller?

What you should monitor during training

Centering and calibration

A tiny reward audit

The real evaluation question

KL control intuition (preview)

Standardized evaluation: RewardBench

Your reward model tops the RewardBench leaderboard. Is it automatically the right choice for your PPO run?

Beyond scalar reward heads

When an explicit reward model is worth the cost

Your reward model has good static pair accuracy, but once PPO starts, reward climbs while human raters say answers are getting verbose and manipulative. What is the first diagnosis?

You have a clean offline preference dataset and no need to score large candidate pools or run PPO. What simpler baseline should you evaluate first?

Common pitfalls

Symptom: loss falls but held-out rankings barely improve

Symptom: one labeler style dominates the reward model

Symptom: PPO reward rises while human preference gets worse

Symptom: product dashboards treat reward as truth

Practice checkpoints

Why not always skip reward modeling and use DPO immediately?

What does a reward model usually look like architecturally?

What makes reward-model evaluation different from ordinary classification accuracy?

Mastery check

Evaluation rubric

Follow-up questions

Pair accuracy is 92 percent on held-out pairs. Why is that still weak evidence for PPO readiness?

When would you pay the extra complexity cost of an explicit reward model instead of starting with DPO?

A reward model keeps preferring longer answers even when raters say they are bloated. What exactly should you change in evaluation first?

RewardBench rank and same-family transfer disagree. Which one should drive your PPO choice?

Mastery Check

Reward Modeling from Preference Data

What reward modeling is trying to learn

What the dataset looks like

Standard format

Conversational format

A pair contract before training

The usual architecture

Bradley-Terry loss in one page

Why does a larger positive reward margin make the Bradley-Terry loss smaller?

What you should monitor during training

Centering and calibration

A tiny reward audit

The real evaluation question

KL control intuition (preview)

Standardized evaluation: RewardBench

Your reward model tops the RewardBench leaderboard. Is it automatically the right choice for your PPO run?

Beyond scalar reward heads

When an explicit reward model is worth the cost

Your reward model has good static pair accuracy, but once PPO starts, reward climbs while human raters say answers are getting verbose and manipulative. What is the first diagnosis?

You have a clean offline preference dataset and no need to score large candidate pools or run PPO. What simpler baseline should you evaluate first?

Common pitfalls

Symptom: loss falls but held-out rankings barely improve

Symptom: one labeler style dominates the reward model

Symptom: PPO reward rises while human preference gets worse

Symptom: product dashboards treat reward as truth

Practice checkpoints

Why not always skip reward modeling and use DPO immediately?

What does a reward model usually look like architecturally?

What makes reward-model evaluation different from ordinary classification accuracy?

Mastery check

Evaluation rubric

Follow-up questions

Pair accuracy is 92 percent on held-out pairs. Why is that still weak evidence for PPO readiness?

When would you pay the extra complexity cost of an explicit reward model instead of starting with DPO?

A reward model keeps preferring longer answers even when raters say they are bloated. What exactly should you change in evaluation first?

RewardBench rank and same-family transfer disagree. Which one should drive your PPO choice?

Mastery Check