LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationRLHF & DPO Alignment
🛡️HardAlignment & Safety

RLHF & DPO Alignment

Understand the RLHF pipeline and DPO, including reward modeling, PPO mechanics, and the trade-offs between iterative reinforcement learning and direct preference optimization.

37 min read
Learning path
Step 102 of 155 in the full curriculum
Reward Modeling from Preference DataConstitutional AI & Red Teaming

Reward modeling showed how to turn chosen/rejected responses into a scalar preference signal. RLHF and DPO ask what to do with that signal: train a separate reward-driven policy, or optimize the policy directly from preference pairs.

Imagine asking a customer-support AI whether a delayed order should be refunded, and it returns something fluent, confident, and wrong. Or asking about a return-policy exception and getting advice that sounds authoritative but violates company rules. InstructGPT frames this gap clearly: pretrained language models can be untruthful, toxic, or unhelpful even when their text is polished.[1]

That is the alignment problem. Pretraining optimizes next- prediction, not direct human preference. A language model optimized to predict plausible text can still produce responses that sound useful whether or not they are helpful, truthful, or safe. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) are post-training methods that push a model toward responses preferred under a labeling policy. They don't, by themselves, prove truthfulness or safety outside that policy's coverage.

The gap SFT can't close

After instruction tuning, a model can follow instructions. But it may still generate harmful content, be dishonest, refuse reasonable requests, or produce verbose and unhelpful answers. This misalignment happens because next-token prediction doesn't inherently enforce a product's safety rules or a labeler's preference rubric. Predicting what comes next isn't the same as satisfying the behavior being evaluated.

Alignment optimizes for human preferences beyond task completion. It bridges the gap between next-token prediction (what's probable?) and helpful interaction (what's desirable?).

To see why SFT alone isn't enough, consider a customer-support bot given this context: "Policy: offer a refund after seven calendar days without delivery. Customer: My package hasn't arrived after 10 days. Can I get a refund?" The SFT model might generate two grammatically correct responses:

  • Response A: "I can help with that. Since your package is 10 days late, you're eligible for a refund under our 7-day late-delivery policy. Would you like me to process that now?"
  • Response B: "Packages sometimes take longer than expected. Please wait another week and contact us if it still hasn't arrived."

Response A follows the supplied policy. Response B is unhelpful and violates that stated policy. Both can look fluent. SFT teaches the model to imitate demonstrations; preference training supplies comparative signal when those distinctions are represented in the labeled data.[1]

That preference signal is what alignment adds.

Alignment gap diagram showing a refund prompt, two fluent SFT responses, human preference selecting the policy-correct response, and aligned behavior that combines usefulness, truthfulness, and safety. Alignment gap diagram showing a refund prompt, two fluent SFT responses, human preference selecting the policy-correct response, and aligned behavior that combines usefulness, truthfulness, and safety.
SFT can make both answers fluent. Alignment adds the preference signal that separates policy-correct help from polished but wrong text.

RLHF: building a judge and an actor

RLHF (Reinforcement Learning from Human Feedback) is the classic large-scale preference-alignment pipeline popularized by InstructGPT: start from an SFT model, train a reward model from human comparisons, then optimize the policy with reinforcement learning.[1]

The key distinction is separation of roles. SFT gives the model baseline instruction-following behavior, the reward model approximates labeled preference, and PPO updates the policy while a KL term discourages large drift from the SFT reference.

Three-phase pipeline

RLHF runs in three distinct steps. First, collect demonstrations and perform Supervised Fine-Tuning (SFT) to establish baseline instruction-following. Then gather comparison data to train a proxy judge (the reward model). Finally, use reinforcement learning to optimize the language model against that judge.

Diagram showing Base model, SFT model, Generate response pairs same prompt, two answers, and Human ranks chosen over rejected. Diagram showing Base model, SFT model, Generate response pairs same prompt, two answers, and Human ranks chosen over rejected.
Base model, SFT model, Generate response pairs same prompt, two answers, and Human ranks chosen over rejected.

Reward models and the Bradley-Terry trick

Think of the reward model like a support-policy auditor comparing two draft replies. The auditor doesn't need to generate the reply; it needs to score the policy-correct, helpful answer above the vague or unsafe one.

The mathematical foundation is the Bradley-Terry model, which assumes the probability that a human prefers response ywy_wyw​ over yly_lyl​ depends on their underlying "reward" scores:[1][2]

P(yw≻yl∣x)=σ(r(x,yw)−r(x,yl))P(y_w \succ y_l \mid x) = \sigma\left(r(x, y_w) - r(x, y_l)\right)P(yw​≻yl​∣x)=σ(r(x,yw​)−r(x,yl​))

This model captures the intuition that the larger the gap in quality between two responses, the more confident we are about which one a human would prefer.

Let's walk through a concrete example. Suppose the reward model scores Response A at 2.0 and Response B at 0.5 for the same refund prompt. The probability that a human prefers A over B is:

P(A≻B)=σ(2.0−0.5)=σ(1.5)≈0.818P(A \succ B) = \sigma(2.0 - 0.5) = \sigma(1.5) \approx 0.818P(A≻B)=σ(2.0−0.5)=σ(1.5)≈0.818

That means the model predicts an 81.8% chance the human picks A. If the human picked A, the log-likelihood of that observation is log⁡(0.818)≈−0.201\log(0.818) \approx -0.201log(0.818)≈−0.201. If the model had instead predicted B > A, the log-likelihood would be log⁡(1−0.818)=log⁡(0.182)≈−1.704\log(1 - 0.818) = \log(0.182) \approx -1.704log(1−0.818)=log(0.182)≈−1.704. The loss pushes the reward model toward the higher probability assignment.

reward-models-and-the-bradley-terry-trick.py
1import math 2 3def sigmoid(value: float) -> float: 4 return 1 / (1 + math.exp(-value)) 5 6reward_a = 2.0 7reward_b = 0.5 8preference_probability = sigmoid(reward_a - reward_b) 9negative_log_likelihood = -math.log(preference_probability) 10 11print(f"P(A preferred over B) = {preference_probability:.1%}") 12print(f"NLL = {negative_log_likelihood:.3f}")
Bradley-Terry output
1P(A preferred over B) = 81.8% 2NLL = 0.201

The reward model is trained to predict which of two responses a human would prefer. The loss function maximizes the log-likelihood of the observed preferences:

LRM=−E(x,yw,yl)[log⁡σ(r(x,yw)−r(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\left(r(x, y_w) - r(x, y_l)\right)\right]LRM​=−E(x,yw​,yl​)​[logσ(r(x,yw​)−r(x,yl​))]

where ywy_wyw​ is the preferred (winning) response and yly_lyl​ is the rejected (losing) response for the same prompt xxx.

PPO and KL drift control

Once we have a reward model, we treat the language model as a policy πθ\pi_\thetaπθ​. We use Proximal Policy Optimization (PPO), a reinforcement learning algorithm designed to make clipped, conservative policy updates[3], to maximize the expected reward. The objective maximizes reward while penalizing deviation from the reference policy:

max⁡θ  Ex∼D,y∼πθ[r(x,y)]−β⋅Ex∼D[KL(πθ(⋅∣x)∥πref(⋅∣x))]\begin{aligned} \max_\theta\;& \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[r(x, y)\right] \\ &- \beta \cdot \mathbb{E}_{x \sim D} \left[\text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x))\right] \end{aligned}θmax​​Ex∼D,y∼πθ​​[r(x,y)]−β⋅Ex∼D​[KL(πθ​(⋅∣x)∥πref​(⋅∣x))]​

Think of the Kullback-Leibler (KL) term as a drift budget. The policy may seek higher learned reward, while deviations from the reference cost reward. This can reduce exposure to regions where the reward model is unreliable; it can't prove the reward is correct or prevent every exploitable shortcut. InstructGPT applies a per-token KL penalty to mitigate reward-model overoptimization.[1]

To see why drift control matters, imagine the policy starts generating refund responses that begin with "I am an AI assistant designed to help with refunds." If the reward model accidentally scores that opening higher because it correlates with politeness in training data, the policy can amplify it. A KL term charges departures from the reference, while held-out human checks still determine whether behavior improved.

In the common conceptual PPO-style RLHF setup, one update coordinates four model roles. Implementations can shard, offload, or share parts of these roles, so this is a compute-and-state picture rather than a universal memory layout:

  1. Actor/Policy (πθ\pi_\thetaπθ​): The model being trained
  2. Reference (πref\pi_{\text{ref}}πref​): Frozen SFT model for KL penalty
  3. Reward Model (rϕr_\phirϕ​): Frozen model that scores outputs
  4. Critic/Value (VψV_\psiVψ​): Predicts expected returns for advantage estimation
PPO-style RLHF loop showing actor, reference model, reward model, and critic, plus a single PPO update that combines reward, KL control, and value estimates. PPO-style RLHF loop showing actor, reference model, reward model, and critic, plus a single PPO update that combines reward, KL control, and value estimates.
PPO-style RLHF is expensive because one update coordinates four roles: actor, reference, reward model, and critic.

The calculation below isolates the shaped reward for sampled responses. It isn't a PPO trainer: a full implementation also estimates advantages, clips updates, and applies token-level masks.

kl_regularized_reward.py
1samples = [ 2 {"name": "near_reference", "reward": 1.20, "policy_logp": -2.1, "ref_logp": -2.2}, 3 {"name": "high_drift", "reward": 1.35, "policy_logp": -1.2, "ref_logp": -2.8}, 4] 5beta = 0.20 6 7for sample in samples: 8 sampled_log_ratio = sample["policy_logp"] - sample["ref_logp"] 9 shaped_reward = sample["reward"] - beta * sampled_log_ratio 10 print(f"{sample['name']}: raw={sample['reward']:.2f} shaped={shaped_reward:.2f}")
KL-shaped reward output
1near_reference: raw=1.20 shaped=1.18 2high_drift: raw=1.35 shaped=1.03

The higher raw score isn't automatically the preferred optimization target once policy drift is charged. In production, PPO implementations compute token-level KL penalties, estimate returns with a value model and often GAE, and separate collection from policy updates. The point here is the system shape: PPO-style RLHF generates fresh responses and coordinates policy, reference, reward, and value roles.[1][3]

Why RLHF is hard to run

Despite its power, the RLHF pipeline introduces practical challenges that make it expensive to run and hard to debug at scale.

The first problem is systems overhead. Standard PPO-style RLHF needs actor, reference, reward, and critic/value roles during a training cycle. Sharding and offload change resident memory, but not the computation and coordination burden. Each rollout batch also begins with current-policy generation before updates, so throughput becomes partly an inference problem.

Reinforcement learning also adds more knobs than supervised fine-tuning. PPO is sensitive to reward scale, the KL coefficient (beta), clipping threshold (epsilon), value-loss weighting, advantage estimation, and rollout quality. Weak drift control increases exposure to reward-model exploitation; overly strong control can suppress useful updates.[1][3]

The reward model itself is a silent failure point. It was trained on a finite collection of human comparisons. When the improving policy starts generating responses outside that distribution, the proxy scores become unreliable. That is when overoptimization appears: the numeric reward keeps climbing while real human preference scores plateau or drop. In a customer-support setting, you might see the model start adding long disclaimers or repetitive apologies because those patterns happened to score high in the limited preference data.

IssueDescription
ComplexityFour logical model roles (policy, reference, reward, value), with memory affected by sharding and offload
Tuning surfacePPO depends on clip ratio, learning rate, value-loss coefficients, reward scale, and sample quality
Reward hackingPolicy exploits reward model weaknesses when it generates out-of-distribution outputs
Coverage gapsPolicy can regress on behaviors that are weakly represented in the reward data
CostHuman preference annotation at scale is expensive; PPO training is compute-intensive

DPO: the mathematical shortcut

PPO-style RLHF has a multi-role online loop. DPO instead targets fixed preference pairs with a simpler objective. It removes engineering components; it doesn't supply online exploration or guarantee the same outcome as every RLHF run.

The key insight

At first glance, DPO looks like a completely different alignment recipe. The key mathematical result in the DPO paper is narrower: under a KL-regularized reward-maximization objective and a Bradley-Terry-style preference model, an implicit reward can be parameterized through policy-to-reference log-probability ratios.[2] That relationship yields an offline pairwise loss without training a standalone reward model or running PPO during DPO training.

Here's the "aha" moment in plain English. Under those modeling assumptions, an implicit reward is encoded by how policy log-probabilities change relative to a reference model. DPO can therefore train directly on labeled pairs. It learns only from the coverage and quality of those comparisons unless you build a separate data-refresh loop.

The implicit reward is:

r∗(x,y)=βlog⁡π∗(y∣x)πref(y∣x)+βlog⁡Z(x)r^*(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)r∗(x,y)=βlogπref​(y∣x)π∗(y∣x)​+βlogZ(x)

Where π∗\pi^*π∗ is the optimal policy for the stated objective, πref\pi_{\text{ref}}πref​ is the reference model, β\betaβ is tied to the KL-regularization trade-off and scales the DPO logit, and Z(x)Z(x)Z(x) is a normalization term that depends only on prompt xxx. DPO then trains a policy πθ\pi_\thetaπθ​ from pairwise preferences under this parameterization.[2]

DPO loss with a worked example

By substituting this relationship back into the preference loss, DPO derives a loss function that depends only on the policy and reference model:

LDPO=−E[log⁡σ(βlog⁡πθ(yw∣x)πref(yw∣x)−βlog⁡πθ(yl∣x)πref(yl∣x))]\begin{aligned} \mathcal{L}_{\text{DPO}} = -\mathbb{E}\Bigg[ \log \sigma\Bigg( &\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} \\ &- \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \Bigg) \Bigg] \end{aligned}LDPO​=−E[logσ(​βlogπref​(yw​∣x)πθ​(yw​∣x)​−βlogπref​(yl​∣x)πθ​(yl​∣x)​)]​

Where ywy_wyw​ is the preferred response and yly_lyl​ is the rejected response for the same prompt xxx, and πref\pi_{\text{ref}}πref​ is the reference model that provides the baseline distribution. The reference model defines the relative log-probability baseline and shapes the divergence trade-off; it isn't proof that outputs stay fluent, calibrated, or safe.

Let's trace the loss with actual numbers. Suppose for our refund prompt we have:

  • Chosen response (A): policy log-prob = -8.2, reference log-prob = -8.5
  • Rejected response (B): policy log-prob = -9.5, reference log-prob = -9.1
  • DPO scale / regularization parameter: β=0.1\beta = 0.1β=0.1

Step by step:

  1. Policy advantage for chosen: −8.2−(−8.5)=0.3-8.2 - (-8.5) = 0.3−8.2−(−8.5)=0.3
  2. Policy advantage for rejected: −9.5−(−9.1)=−0.4-9.5 - (-9.1) = -0.4−9.5−(−9.1)=−0.4
  3. Margin inside the sigmoid: 0.3−(−0.4)=0.70.3 - (-0.4) = 0.70.3−(−0.4)=0.7
  4. Scaled by beta: 0.1×0.7=0.070.1 \times 0.7 = 0.070.1×0.7=0.07
  5. Sigmoid probability: σ(0.07)≈0.517\sigma(0.07) \approx 0.517σ(0.07)≈0.517
  6. Loss: −log⁡(0.517)≈0.659-\log(0.517) \approx 0.659−log(0.517)≈0.659

The loss pushes the policy to increase that 0.7 margin. If the policy becomes more confident on the chosen response (say, log-prob rises to -7.5 while the rejected stays similar), the margin grows, the sigmoid probability approaches 1.0, and the loss drops toward 0. In gradient terms, the update increases the likelihood of the chosen answer relative to the reference while decreasing the likelihood of the rejected answer, all scaled by how far the current policy already deviates from the reference.

dpo_margin_from_logprobs.py
1from math import exp, log 2 3chosen_policy, chosen_reference = -8.2, -8.5 4rejected_policy, rejected_reference = -9.5, -9.1 5beta = 0.1 6margin = (chosen_policy - chosen_reference) - (rejected_policy - rejected_reference) 7logit = beta * margin 8loss = -log(1 / (1 + exp(-logit))) 9 10print(f"relative_margin={margin:.1f}") 11print(f"scaled_logit={logit:.2f}") 12print(f"dpo_loss={loss:.3f}")
DPO margin output
1relative_margin=0.7 2scaled_logit=0.07 3dpo_loss=0.659

The implementation of the DPO loss is straightforward. The runnable example below uses a tiny causal language model so the tensor shapes are real without downloading an external checkpoint. A production implementation would add padding masks, attention masks, distributed training, and tokenizer-specific details.

dpo-loss-with-a-worked-example.py
1import torch 2import torch.nn as nn 3import torch.nn.functional as F 4from types import SimpleNamespace 5 6class TinyCausalLM(nn.Module): 7 def __init__(self, vocab_size: int = 16, hidden_size: int = 8): 8 super().__init__() 9 self.embedding = nn.Embedding(vocab_size, hidden_size) 10 self.projection = nn.Linear(hidden_size, vocab_size) 11 12 def forward(self, input_ids: torch.Tensor) -> SimpleNamespace: 13 hidden = self.embedding(input_ids) 14 return SimpleNamespace(logits=self.projection(hidden)) 15 16def compute_log_probs(model, prompt_ids, response_ids) -> torch.Tensor: 17 input_ids = torch.cat([prompt_ids, response_ids], dim=-1) 18 logits = model(input_ids).logits 19 response_logits = logits[:, prompt_ids.shape[-1] - 1:-1, :] 20 log_probs = F.log_softmax(response_logits, dim=-1) 21 token_log_probs = torch.gather( 22 log_probs, dim=-1, index=response_ids.unsqueeze(-1) 23 ).squeeze(-1) 24 return token_log_probs.sum(dim=-1) 25 26def dpo_loss( 27 policy_model: torch.nn.Module, 28 ref_model: torch.nn.Module, 29 batch: dict[str, torch.Tensor], 30 beta: float = 0.1 31) -> torch.Tensor: 32 pi_w = compute_log_probs(policy_model, batch["prompt"], batch["chosen"]) 33 pi_l = compute_log_probs(policy_model, batch["prompt"], batch["rejected"]) 34 35 with torch.no_grad(): 36 ref_w = compute_log_probs(ref_model, batch["prompt"], batch["chosen"]) 37 ref_l = compute_log_probs(ref_model, batch["prompt"], batch["rejected"]) 38 39 logits = beta * ((pi_w - ref_w) - (pi_l - ref_l)) 40 return -F.logsigmoid(logits).mean() 41 42torch.manual_seed(0) 43policy = TinyCausalLM() 44reference = TinyCausalLM() 45reference.load_state_dict(policy.state_dict()) 46for param in reference.parameters(): 47 param.requires_grad_(False) 48 49batch = { 50 "prompt": torch.tensor([[1, 2], [1, 3]]), 51 "chosen": torch.tensor([[4, 5], [4, 6]]), 52 "rejected": torch.tensor([[7, 8], [7, 9]]), 53} 54 55with torch.no_grad(): 56 policy.projection.bias[5] += 1.0 57 policy.projection.bias[6] += 1.0 58 59loss = dpo_loss(policy, reference, batch) 60loss_is_scalar = loss.ndim == 0 61loss_is_finite = bool(torch.isfinite(loss)) 62 63loss.backward() 64grad_norm = sum( 65 param.grad.abs().sum().item() 66 for param in policy.parameters() 67 if param.grad is not None 68) 69 70print(f"DPO loss: {loss.item():.3f}") 71print(f"policy_grad_norm: {grad_norm:.3f}") 72print(f"loss_is_scalar={loss_is_scalar}") 73print(f"loss_is_finite={loss_is_finite}") 74print(f"grad_norm_positive={grad_norm > 0}")
DPO loss output
1DPO loss: 0.648 2policy_grad_norm: 1.784 3loss_is_scalar=True 4loss_is_finite=True 5grad_norm_positive=True

Visible feedback: At initialization, a policy copied from its reference naturally produces logits near zero and loss near 0.693. After training begins, monitor margins together with held-out preference quality and output regressions. Near-zero logits alone don't diagnose weak pairs or a bad beta.

In production, DPO implementations also mask padding tokens and explicitly track the chosen-vs-rejected log-probability margin during training. Teams usually monitor response length too, because whole-sequence log-probability sums can create length bias when chosen and rejected responses have systematically different lengths.

DPO training flow

Vanilla DPO training is operationally simpler than PPO-style RLHF. It applies an offline pairwise loss instead of coordinating rollouts, a learned reward model, and a value model. A reference model provides baseline log-probabilities while the active policy is updated from preference data.

Diagram showing Preference Data (prompt, chosen, rejected), DPO Training (standard SFT-like), Reference Model (frozen SFT), and Aligned Model. Diagram showing Preference Data (prompt, chosen, rejected), DPO Training (standard SFT-like), Reference Model (frozen SFT), and Aligned Model.
Preference Data (prompt, chosen, rejected), DPO Training (standard SFT-like), Reference Model (frozen SFT), and Aligned Model.

DPO turns preference alignment into a supervised pairwise classification loss. You still need a frozen reference model and high-quality chosen/rejected pairs, but you no longer train a separate reward model or value function.

Comparing RLHF and DPO

To better understand the trade-offs between the traditional RLHF pipeline and DPO, compare what each method optimizes and what it has to keep alive during training.[1][2]

RLHF vs DPO comparison showing RLHF as an online PPO loop with policy, reference, reward, and critic components, while DPO trains a policy against a frozen reference model on fixed preference pairs. RLHF vs DPO comparison showing RLHF as an online PPO loop with policy, reference, reward, and critic components, while DPO trains a policy against a frozen reference model on fixed preference pairs.
RLHF coordinates an online rollout loop with more moving pieces. Vanilla DPO keeps the frozen reference model but removes the separate reward model, critic, and PPO rollout loop.
FeatureRLHF (PPO)DPO
Reward modelTrain a separate reward modelNo separate reward model; reward is implicit in the objective
Training loopOn-policy RL with rolloutsOffline pairwise loss on fixed comparisons
Logical model rolesPolicy + ref + reward + criticPolicy + reference
Most expensive stepSampling and scoring fresh responsesForward/backward passes on fixed preference pairs
StabilitySensitive to reward scale, KL weight, and PPO settingsFewer interacting loops, but still sensitive to data and beta
Online explorationYesNo, not in vanilla DPO
Data dependenceCan keep collecting fresh comparisons during trainingLimited by current preference coverage unless data is refreshed
Monitoring needReward drift, KL drift, value loss, rollout qualityPreference loss, margin growth, length bias

DPO lowers system complexity and is a practical candidate baseline for offline pairwise preferences. That doesn't mean PPO is obsolete. If a team needs online exploration, fresh model-generated negatives, or a reward signal that changes as the model improves, RL-style alignment has capabilities fixed-dataset DPO doesn't.

Preference data: the foundation

Whether a team uses RLHF or DPO, preference-signal quality sets a major performance ceiling. In practice, collecting and cleaning that signal is often the slowest and most expensive part of the pipeline. Both methods start from pairwise preference data, but they use it differently: RLHF trains a reward model on those comparisons, while DPO optimizes the policy directly on the same comparisons.[1][2]

Data requirements

Regardless of the specific algorithm chosen, the foundation of preference optimization relies on high-quality preference data. A usable row needs more than chosen and rejected: it needs the same rendered prompt context, a clear non-tie label, and split provenance that prevents related comparisons leaking into evaluation.

data-requirements.py
1preference_example = { 2 "prompt_id": "damaged-order-17", 3 "prompt": "A customer says their order arrived damaged. What should I do?", 4 "chosen": "I'm sorry to hear that. I can start a replacement or refund right away. " 5 "Which would you prefer?", 6 "rejected": "Things sometimes get damaged in shipping. You can file a claim " 7 "with the carrier if you want.", 8 "label": "chosen" 9} 10 11required_fields = {"prompt_id", "prompt", "chosen", "rejected", "label"} 12fields_present = required_fields <= preference_example.keys() 13distinct_responses = preference_example["chosen"] != preference_example["rejected"] 14binary_label = preference_example["label"] in {"chosen", "rejected"} 15 16print(f"fields_present={fields_present}") 17print(f"distinct_responses={distinct_responses}") 18print(f"binary_label={binary_label}") 19print("next_gate=group prompt_id before train/eval split")
Preference triple output
1fields_present=True 2distinct_responses=True 3binary_label=True 4next_gate=group prompt_id before train/eval split

Annotation quality guidelines

Poorly annotated data can lead to undesirable behavior or failure to generalize the intended policy. That is why preference annotation needs a clear rubric, disagreement handling, and held-out checks.

FactorGood PracticeBad Practice
Annotator agreementClear rubric, adjudication on disagreementsAmbiguous rubric with unresolved disagreement
Preference clarityKeep labeler-consistent judgments, including hard pairsTreat unresolved ties or disagreement as clean binary labels
DiversityCover edge cases, refusals, creativityOnly easy instructions
Source and split provenanceRecord generators; group related pairs by prompt before splittingMix related comparisons across train and evaluation
Preference data quality dashboard showing raw response pairs filtered through rubric clarity, disagreement handling, label agreement, diversity, and model-source checks before becoming training-ready pairs. Preference data quality dashboard showing raw response pairs filtered through rubric clarity, disagreement handling, label agreement, diversity, and model-source checks before becoming training-ready pairs.
Preference data quality is a filtering problem. Clear rubrics, grouped splits, diverse cases, and fresh-policy checks distinguish usable signal from annotation artifacts.

Preference data from another generator isn't automatically invalid. The risk is coverage mismatch: labels may compare styles or capabilities the target policy rarely produces, while ignoring its actual failure modes. Record each generator, evaluate on fresh target-policy outputs, and include current-policy samples when distribution shift matters.

prompt_group_split.py
1pairs = [ 2 {"prompt_id": "p1", "pair_id": "a-b"}, 3 {"prompt_id": "p1", "pair_id": "a-c"}, 4 {"prompt_id": "p2", "pair_id": "d-e"}, 5] 6eval_prompts = {"p1"} 7train_prompts = {row["prompt_id"] for row in pairs if row["prompt_id"] not in eval_prompts} 8overlap = train_prompts & eval_prompts 9 10assert not overlap 11print(f"eval_pairs={sum(row['prompt_id'] in eval_prompts for row in pairs)}") 12print(f"train_pairs={sum(row['prompt_id'] in train_prompts for row in pairs)}") 13print(f"prompt_overlap={sorted(overlap)}")
Prompt-group split output
1eval_pairs=2 2train_pairs=1 3prompt_overlap=[]

While data quality matters more than raw count, dataset size still changes what the model can learn.

What scale changes

  • A narrow domain can improve from a relatively small, carefully curated preference set.
  • A general assistant needs much broader coverage across refusals, tool use, reasoning, style, and failure cases.
  • If annotators keep disagreeing, more labels can quantify uncertainty, but they won't turn an ambiguous rubric into clear training direction.

When alignment backfires

The most subtle failure mode in alignment is reward hacking: the model learns to exploit reward model weaknesses rather than genuinely improving. This happens because the reward model is merely a proxy for human preference, not a perfect representation of it.

When the language model is optimized against this proxy, it can discover edge cases where the proxy assigns high scores to degenerate outputs. The proxy becomes the target, and the target can stop representing true preference.

For example, if the reward model overweights length or surface formatting, policy optimization can amplify those traits faster than actual answer quality. The visible symptom is rising proxy reward without a matching improvement in held-out human preference.

Reward HackSymptomDetection
Length gamingResponses get progressively longerTrack avg response length during training
SycophancyModel agrees with user even when wrongTest with incorrect user statements
Format exploitationModel wraps everything in markdown/lists for higher scoresCompare formatted vs plain-text scores
Confidence mimicryModel sounds confident even when uncertainTest calibration on known-hard questions

The verbosity bias

Aligned models can become wordy when human annotators equate length with quality. A long, rambling refund explanation might score higher than a concise, accurate one because it looks more thorough. If your reward data has this bias, the policy learns to pad responses.

Symptom: Average response length grows steadily during training while helpfulness scores plateau. Fix: Add length-matched preference evaluations and annotate concise versus padded answers explicitly. Length penalties or normalization are interventions to validate, because they can also punish necessary detail.

Ambiguous labels and the 0.693 trap

If the policy begins equal to its reference, every DPO relative margin starts at zero and the loss starts near -log(0.5) = 0.693. That is expected, not evidence that an individual pair provides no gradient. At zero logit, the binary loss has its largest directional gradient magnitude.

Ambiguous pairs still hurt: if different annotators or duplicate prompts point in conflicting directions, their updates can cancel in aggregate or teach arbitrary style preferences. Diagnose data quality from agreement, slices, and held-out behavior, not from initial loss alone.

dpo_zero_margin_is_not_zero_gradient.py
1from math import exp, log 2 3def sigmoid(value): 4 return 1 / (1 + exp(-value)) 5 6for logit in [0.0, 2.0]: 7 loss = -log(sigmoid(logit)) 8 gradient = sigmoid(logit) - 1 9 print(f"logit={logit:.1f} loss={loss:.3f} gradient={gradient:.3f}")
DPO zero-margin output
1logit=0.0 loss=0.693 gradient=-0.500 2logit=2.0 loss=0.127 gradient=-0.119

Symptom: Loss remains near 0.693 after meaningful training and held-out preference metrics don't improve. Fix: Check that the policy is updating, then inspect contradictory labels, unresolved ambiguity, prompt leakage, and missing task coverage before changing beta.

Mitigation strategies

To combat reward hacking and ensure the model genuinely improves, alignment engineers employ several defensive techniques:

  1. Hold out human eval data: Check whether higher proxy reward also improves real human preference wins
  2. Track auxiliary metrics: Monitor length, refusal rate, formatting drift, and calibration rather than trusting one scalar score
  3. KL budgets: Cap total KL divergence from the reference model during training
  4. Refresh the scorer: Re-label or retrain the reward signal when current policy outputs drift too far from the scorer's training data
alignment_release_gate.py
1metrics = { 2 "proxy_reward_delta": 0.31, 3 "held_out_human_win_delta": -0.04, 4 "mean_length_delta_tokens": 48, 5} 6ready = metrics["held_out_human_win_delta"] > 0 and metrics["mean_length_delta_tokens"] < 20 7 8print(f"proxy_improved={metrics['proxy_reward_delta'] > 0}") 9print(f"human_preference_improved={metrics['held_out_human_win_delta'] > 0}") 10print(f"release_ready={ready}")
Alignment release gate output
1proxy_improved=True 2human_preference_improved=False 3release_ready=False
Alignment failure guardrails: proxy reward and DPO margin can mislead, so teams compare proxy metrics against held-out human wins, watch length and formatting drift, and refresh preference data or reward models when distribution shifts. Alignment failure guardrails: proxy reward and DPO margin can mislead, so teams compare proxy metrics against held-out human wins, watch length and formatting drift, and refresh preference data or reward models when distribution shifts.
Alignment can fail in two common ways: reward-model exploitation and weak preference gradients. Good guardrails compare proxy movement against held-out human judgment and watch for drift in length, format, and data freshness.

Online DPO and iterative alignment

Vanilla DPO is an offline algorithm: it trains on a fixed dataset of chosen and rejected responses.[2] As the policy improves, those old comparisons become less representative of what the current model produces.

Several later papers study this online setting: sample fresh responses from the current policy, label those pairs with humans or a learned preference model, then apply an IPO-style update, or a closely related pairwise preference loss, on the new comparisons.[4]

Diagram showing Prompts dataset, Sample responses from current policy, Human or preference model chooses winner, and Build fresh pair chosen and rejected. Diagram showing Prompts dataset, Sample responses from current policy, Human or preference model chooses winner, and Build fresh pair chosen and rejected.
Prompts dataset, Sample responses from current policy, Human or preference model chooses winner, and Build fresh pair chosen and rejected.

Conceptually, an online DPO loop has five steps:

  1. Sample two or more candidate responses from the current policy for each prompt.
  2. Ask humans or a preference model to choose the better candidate.
  3. Convert the result into a fresh (prompt, chosen, rejected) triple.
  4. Run a DPO-style update using the chosen reference-policy strategy.
  5. Repeat with new policy outputs so the data distribution follows the model as it changes.

There isn't one canonical "online DPO" algorithm, and reference-policy update choices differ across methods. The family resemblance is the important part: refresh preference data on-policy while keeping a pairwise preference loss instead of a full PPO loop.[4]

Online DPO advantages

  • Reduced distribution mismatch: The model learns from its own current outputs, not only from stale comparisons.
  • Harder negatives over time: As the policy improves, the rejected samples become stronger and more informative.
  • Keeps the DPO loss: You retain a simpler pairwise objective instead of moving all the way to PPO.

Beyond vanilla DPO

While DPO is the standard offline baseline, it still requires high-quality pairwise preference data (yw,yl)(y_w, y_l)(yw​,yl​). That data is expensive to collect and often noisy, especially when annotators disagree or the difference between two good answers is subtle. Newer methods attempt to relax this requirement, improve the optimization objective, or combine training stages to reduce pipeline complexity.

From that broader design space, several practical variants and adjacent post-training methods have become common discussion points:

MethodKey Difference
IPO (Identity Preference Optimization)Replaces DPO's log-sigmoid loss with a squared loss on the preference margin, so it can't drive the margin to infinity. This adds explicit regularization against overfitting when preference labels are nearly deterministic.[5]
KTO (Kahneman-Tversky Optimization)Eliminates the need for pairs entirely. It works with binary feedback (thumbs up/down) for individual outputs, treating alignment as maximizing the utility of "good" outputs while minimizing "bad" ones.[6]
ORPO (Odds Ratio Preference Optimization)Combines SFT and preference alignment into a single training stage, removing the need for a separate reference model.[7]
SimPO (Simple Preference Optimization)Drops the reference model and uses length-normalized average log-probability as the implicit reward, plus a target margin. Its objective directly addresses sequence-length dependence; the authors tune a larger beta range than DPO in their experiments.[8]
GRPO (Group Relative Policy Optimization)Not an offline preference-optimization loss. It is an online RL method that compares sampled outputs for the same prompt and uses group-relative advantages instead of a learned critic. Reward signals may be programmatic or learned; DeepSeek-R1 is a prominent verifier-heavy reasoning example.[9][10]

IPO, KTO, ORPO, and SimPO all still live in the offline preference-optimization family. GRPO sits on a different branch: it is online reinforcement learning, not a DPO variant. Programmatic correctness checks are one high-value reward source, not a requirement of the algorithm.

For teams optimizing subjective assistant behavior from fixed pairs, DPO is a useful baseline candidate. If you only have binary feedback rather than strict pairs, KTO is designed for that shape. If memory is tight, ORPO and SimPO avoid a separate frozen reference model. If DPO overfits near-deterministic labels, IPO tests a bounded-margin alternative. If the task has strong programmatic verifiers, online RL methods such as GRPO become worth evaluating.

grpo_group_relative_advantage.py
1from statistics import mean, pstdev 2 3rewards = [1.0, 0.0, 0.5, 1.0] 4center = mean(rewards) 5scale = pstdev(rewards) 6advantages = [(reward - center) / scale for reward in rewards] 7 8print(f"group_mean={center:.3f}") 9print(f"advantages={[round(value, 3) for value in advantages]}") 10print(f"advantage_mean={mean(advantages):.3f}")
Group-relative advantage output
1group_mean=0.625 2advantages=[0.905, -1.508, -0.302, 0.905] 3advantage_mean=0.000

Modern post-training family map

These names get mixed together too easily. Some are optimizers, some are reward-model families, and some are full pipeline patterns.

FamilyTypeTraining signalBest fit
DPOoffline preference optimizationchosen vs rejected pairsbaseline candidate for offline subjective preference pairs
IPOoffline preference optimizationchosen vs rejected pairs plus squared-loss marginwhen preference labels are near-deterministic and DPO overfits[5]
ORPOoffline preference optimizationchosen vs rejected pairs plus odds-ratio objectivewhen you want SFT and preference learning in one stage[7]
KTOoffline preference optimizationbinary good/bad labelswhen telemetry or moderation labels exist but strict pairs don't[6]
SimPOoffline preference optimizationchosen vs rejected pairs, reference-free, length-normalizedwhen reference-model memory is tight and length bias is a concern[8]
GRPOonline RLgrouped sampled outputs plus reward functions or modelstasks where online sampled comparison is valuable, often with checkable outcomes[9][10]
ORMreward-model familyone learned scalar score on final answerwhen outcome-level reward labels are the appropriate signal
PRMreward-model familyscalar scores on intermediate reasoning stepssearch or long-chain reasoning where early bad steps should be pruned[11]

Two clarifications keep taxonomy clean:

  1. PRM and ORM aren't alternatives to DPO in the same sense as ORPO or KTO. They are scorers that can feed RLHF or reasoning-time search loops.
  2. GRPO isn't a DPO variant. It's online reinforcement learning that uses group-relative advantages instead of a learned critic.[9]

Use this shortcut:

  • offline subjective alignment: evaluate DPO as a baseline
  • offline binary feedback: consider KTO
  • single-stage preference plus SFT: consider ORPO
  • reference-free, length-bias-sensitive: consider SimPO
  • near-deterministic labels, DPO overfits: consider IPO
  • online verifiable reasoning: consider GRPO with programmatic verifier rewards
post_training_method_router.py
1scenarios = { 2 "paired_offline_feedback": "DPO baseline", 3 "binary_offline_feedback": "KTO candidate", 4 "online_checkable_reward": "GRPO candidate", 5} 6 7for signal, candidate in scenarios.items(): 8 print(f"{signal} -> {candidate}")
Method routing output
1paired_offline_feedback -> DPO baseline 2binary_offline_feedback -> KTO candidate 3online_checkable_reward -> GRPO candidate

That map is also why curriculum splits reward modeling, RLHF/DPO, and RLVR into separate lessons. Names overlap in conversation, but actual training loops differ.

Constitutional AI and RLAIF

A parallel evolution in alignment reduces the amount of human labeling by replacing most pairwise annotations with AI feedback. Constitutional AI (CAI), introduced by Bai et al. (2022)[12], defines a written set of principles (a "constitution") and uses the model itself to:

  1. Critique its own outputs against each principle
  2. Revise the output to comply with the violated principle
  3. Rank candidate outputs under those principles to generate preference data

In the original Constitutional AI paper, those AI critiques and revisions are used in two stages.[12] First, revised responses from self-critique supervise an SFT-style phase. Then the model samples alternative answers, an AI judge ranks them under the constitution, a preference model is trained on those rankings, and an RL stage optimizes against that model.[12] The core advantage is scale: you can synthesize far more feedback from a constitution than by hiring humans to label every harmful example. Bai et al. show that this can train a more harmless, less evasive assistant with far fewer human labels, though the result still depends heavily on the quality of the constitution and the judge model.[12]

Constitutional AI is especially valuable when alignment requirements can be written as explicit rules, such as "cite retrieved sources when making factual claims" or "escalate high-risk refund and safety issues instead of sounding certain." A written constitution makes the feedback policy easier to audit and update than ad hoc human-labeling instructions.

Common pitfalls

  • Reward rises while human preference gets worse. Symptom: offline reward looks better, but held-out human wins flatten or drop. Fix: keep human eval held out, watch length and formatting drift, cap KL movement, and refresh the reward signal when policy outputs shift.
  • DPO loss stays near 0.693 after updates. Symptom: held-out pair quality also fails to improve. Fix: first verify gradients and optimizer steps, then audit contradictory labels, ties, leakage, and coverage rather than treating initialization loss as a diagnosis.
  • DPO improves style but misses safety boundaries. Symptom: outputs sound polished yet still cross policy lines. Fix: add targeted safety pairs, refusal-quality checks, and policy-specific evals instead of assuming generic preference data covers safety.
  • PPO updates make the model incoherent. Symptom: reward rises, but generations get unstable or repetitive. Fix: stop promotion, examine reward scale and drift, then tune KL/update size and refresh evaluation together rather than trusting proxy reward.

Practice checkpoints

Evaluation rubric

  • Foundational: Outline the RLHF pipeline from SFT to reward modeling to PPO.
  • Foundational: Explain why PPO needs a reference policy, reward model, value model, and KL constraint.
  • Intermediate: Compute a Bradley-Terry preference probability and a DPO loss from small log-probability numbers.
  • Intermediate: Compare RLHF's online rollout loop with DPO's offline pairwise objective.
  • Advanced: Diagnose reward hacking, length bias, weak preference pairs, and overtraining symptoms.
  • Advanced: Explain when KTO, ORPO, GRPO, ORM, and PRM belong in the post-training map.
  • Advanced: Explain how Constitutional AI and RLAIF reduce human labeling pressure while introducing judge-quality dependence.

What to remember

  1. Why alignment: Instruction following doesn't equate to helpful, safe, or honest behavior.
  2. RLHF pipeline: SFT to reward model to PPO. A common PPO setup coordinates four model roles during training.
  3. Reward model: Trained on pairwise preference comparisons to predict human judgments.
  4. PPO: Maximizes reward while keeping the model close to the reference using a KL constraint.
  5. DPO: Under its KL-regularized preference-model assumptions, optimizes an implicit reward without a separate reward model or PPO loop.
  6. Tradeoffs: RLHF supports online exploration but has more moving parts; DPO has a simpler offline loop but still depends on data and objective choices.
  7. Modern variants: KTO and ORPO change offline preference tuning requirements, while GRPO is often used for online training with verifiable rewards.
  8. Constitutional AI: RLAIF scales alignment by replacing much of the human labeling with principled AI feedback.
  9. Production reality: Evaluate DPO as an offline baseline when pair data fits the task. Evaluate online methods when fresh sampling or verifier feedback is central.

Follow-up questions

Try this quick exercise before moving on. Suppose you have a DPO training batch with the following log probabilities (all base-e logs):

ResponsePolicy log-probReference log-prob
Chosen (A)-6.2-6.5
Rejected (B)-7.8-7.1

With β=0.1\beta = 0.1β=0.1, compute the DPO loss by hand. Then answer: is the margin positive (the policy already prefers chosen over rejected relative to the reference) or negative?

Next Step
Continue to Constitutional AI & Red Teaming

There, you'll understand how Constitutional AI reduces reliance on repeated human preference labeling through AI critique and ranking, and how automated red teaming stress-tests those safeguards.

PreviousReward Modeling from Preference Data
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

Proximal Policy Optimization Algorithms.

Schulman, J., et al. · 2017

Human Alignment of Large Language Models through Online Preference Optimisation.

Calandriello, D., et al. · 2024

A General Theoretical Paradigm to Understand Learning from Human Feedback.

Azar, M. G., Rowland, M., et al. · 2023

KTO: Model Alignment as Prospect Theoretic Optimization.

Ethayarajh, K., et al. · 2024 · ICML 2024

ORPO: Monolithic Preference Optimization without Reference Model.

Hong, J., Lee, N., & Thorne, J. · 2024

SimPO: Simple Preference Optimization with a Reference-Free Reward

Meng, Y., et al. · 2024

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

GRPO Trainer.

Hugging Face · 2026

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Constitutional AI: Harmlessness from AI Feedback.

Bai, Y., et al. · 2022 · arXiv preprint