LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationReinforcement Learning Basics
📊MediumEvaluation & Benchmarks

Reinforcement Learning Basics

Learn reinforcement learning through the access-review workflow from earlier lessons. Define an MDP, compute discounted returns and Bellman backups, implement value iteration and Q-learning, model abandonment risk, and connect policy gradients to LLM post-training.

14 min read
Learning path
Step 34 of 158 in the full curriculum
Decision Trees, Forests, and BoostingValidation and Leakage

Predicting whether an access-review request needs human review is a one-step decision. A real access assistant must choose a sequence: approve now, request evidence, or route to a reviewer; then react when evidence arrives or the user abandons the request.

Reinforcement learning (RL) studies that sequential decision problem. The agent chooses actions, the environment produces next states and rewards, and the objective is high total reward over an episode. One small access-review workflow gives every equation a concrete operational meaning.[1]Reference 1Reinforcement Learning: An Introductionhttp://incompleteideas.net/book/the-book-2nd.html

Access-review Markov decision process with new request branching to unsupported approval, evidence request, and human review, plus return bars showing certain evidence favors evidence request 59 to 54 but 25-percent abandonment flips the choice to review 54 over 30.75. Access-review Markov decision process with new request branching to unsupported approval, evidence request, and human review, plus return bars showing certain evidence favors evidence request 59 to 54 but 25-percent abandonment flips the choice to review 54 over 30.75.
The MDP makes later consequences explicit. Requesting evidence narrowly wins when evidence always arrives, but measured abandonment risk lowers its expected return enough to make direct review the better policy.

You need Python, small NumPy arrays, and weighted averages. The rewards below are teaching scores, not a production business objective.

From Prediction to an Episode

Access request acct_10234 is stale and policy P-7 requires review unless strong evidence arrives. Three initial actions are available:

Initial actionImmediate effectReward nowLater best outcome
auto_approveCloses an unsupported required-review request-120Terminal mistake
request_evidenceAsks for evidence before approving-4Evidence may support an auto approval
human_reviewEnters reviewer queue immediately-18Reviewer can approve correctly

An episode starts at new_request and ends at closed. Suppose evidence definitely arrives and rewards after the first action are:

RouteReward sequenceMeaning
Approve immediately[-120]Fast, but violates policy
Request evidence, then approval[-4, +70]Small delay, supported resolution
Review, then approve[-18, +80]More handling cost, correct resolution

RL compares routes with the discounted return

G0=r0+γr1+γ2r2+⋯G_0 = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdotsG0​=r0​+γr1​+γ2r2​+⋯

where 0≤γ≤10 \leq \gamma \leq 10≤γ≤1. Values below 111 discount delayed reward; we use γ=0.90\gamma = 0.90γ=0.90.

discounted-return-sequences.py
1gamma = 0.90 2routes = { 3 "auto_approve": [-120], 4 "request_evidence": [-4, 70], 5 "human_review": [-18, 80], 6} 7 8def discounted_return(rewards): 9 return sum((gamma**step) * reward for step, reward in enumerate(rewards)) 10 11for action, rewards in routes.items(): 12 print(f"{action:14} return={discounted_return(rewards):6.1f}")
Output
1auto_approve return=-120.0 2request_evidence return= 59.0 3human_review return= 54.0

With certain evidence arrival, request_evidence earns −4+0.90(70)=59-4 + 0.90(70) = 59−4+0.90(70)=59, just above human_review at 545454. The route decision depends on future consequences as well as immediate cost.

Define the MDP

A Markov Decision Process (MDP) supplies the pieces needed to score a policy:

  • States SSS: what is known at decision time.
  • Actions A(s)A(s)A(s): valid decisions in a state.
  • Transitions P(s′∣s,a)P(s' \mid s, a)P(s′∣s,a): probabilities of the next state.
  • Rewards R(s,a,s′)R(s, a, s')R(s,a,s′): numeric consequence attached to a transition.
  • Discount γ\gammaγ: how strongly delayed consequences count.

Our first model is deliberately deterministic:

StateAvailable actionNext stateReward
new_requestauto_approveclosed-120
new_requestrequest_evidenceevidence_ready-4
new_requesthuman_reviewreview_queue-18
evidence_readyauto_approveclosed+70
evidence_readyhuman_reviewreview_queue-10
review_queueapprove_rotationclosed+80

The state must contain information needed for the next decision. new_request and evidence_ready are different states because approving after evidence isn't the same decision as approving without it. Under the Markov assumption, once the current state and action are known, earlier history doesn't change the transition distribution.[1]Reference 1Reinforcement Learning: An Introductionhttp://incompleteideas.net/book/the-book-2nd.html

Diagram showing new_request requires access review, evidence_ready, review_queue, and closed unsupported approval. Diagram showing new_request requires access review, evidence_ready, review_queue, and closed unsupported approval.
new_request requires access review, evidence_ready, review_queue, and closed unsupported approval.

Value and Bellman Backups

A policy π(a∣s)\pi(a \mid s)π(a∣s) describes how an agent chooses actions. The optimal value V∗(s)V^*(s)V∗(s) is the best expected future return achievable from state sss.

Start at the end:

  • closed has no future reward, so V∗(closed)=0V^*(\texttt{closed}) = 0V∗(closed)=0.
  • review_queue has one useful action, so V^*(\texttt{review_queue}) = 80.
  • evidence_ready chooses between approving for 707070 and reviewing for −10+0.90(80)=62-10 + 0.90(80) = 62−10+0.90(80)=62, so its value is 707070.
  • new_request chooses among −120-120−120, −4+0.90(70)=59-4 + 0.90(70) = 59−4+0.90(70)=59, and −18+0.90(80)=54-18 + 0.90(80) = 54−18+0.90(80)=54, so its value is 595959.

This backward computation is a Bellman backup:

V∗(s)=max⁡a∈A(s)∑s′P(s′∣s,a)[R(s,a,s′)+γV∗(s′)]V^*(s) = \max_{a \in A(s)} \sum_{s'} P(s' \mid s,a) \left[R(s,a,s') + \gamma V^*(s')\right]V∗(s)=a∈A(s)max​s′∑​P(s′∣s,a)[R(s,a,s′)+γV∗(s′)]
one-bellman-backup.py
1gamma = 0.90 2successor_value = { 3 "closed": 0.0, 4 "evidence_ready": 70.0, 5 "review_queue": 80.0, 6} 7start_actions = { 8 "auto_approve": (-120, "closed"), 9 "request_evidence": (-4, "evidence_ready"), 10 "human_review": (-18, "review_queue"), 11} 12 13scores = { 14 action: reward + gamma * successor_value[next_state] 15 for action, (reward, next_state) in start_actions.items() 16} 17for action, score in scores.items(): 18 print(f"Q(new_request, {action:13}) = {score:5.1f}") 19print("greedy action =", max(scores, key=scores.get))
Output
1Q(new_request, auto_approve ) = -120.0 2Q(new_request, request_evidence) = 59.0 3Q(new_request, human_review ) = 54.0 4greedy action = request_evidence
Bellman backup table for auto approval, request evidence, and human review, decomposing totals into immediate rewards minus 120, minus 4, and minus 18 plus discounted successor values 0, 63, and 72, producing action values minus 120, 59, and 54. Bellman backup table for auto approval, request evidence, and human review, decomposing totals into immediate rewards minus 120, minus 4, and minus 18 plus discounted successor values 0, 63, and 72, producing action values minus 120, 59, and 54.
Each action starts at zero, pays its immediate reward, then adds discounted successor value. Requesting evidence reaches `59`, narrowly exceeding review at `54`; the backup stores that maximum as `V*(new_request)`.

Value iteration: plan when the model is known

Hand backups work because this MDP is tiny. Value iteration applies the same backup repeatedly to every non-terminal state until values stop changing. It assumes transition probabilities and rewards are already specified.

value-iteration-access-workflow.py
1gamma = 0.90 2states = ["new_request", "evidence_ready", "review_queue", "closed"] 3actions = { 4 "new_request": ["auto_approve", "request_evidence", "human_review"], 5 "evidence_ready": ["auto_approve", "human_review"], 6 "review_queue": ["approve_rotation"], 7} 8transitions = { 9 ("new_request", "auto_approve"): [(1.0, "closed", -120)], 10 ("new_request", "request_evidence"): [(1.0, "evidence_ready", -4)], 11 ("new_request", "human_review"): [(1.0, "review_queue", -18)], 12 ("evidence_ready", "auto_approve"): [(1.0, "closed", 70)], 13 ("evidence_ready", "human_review"): [(1.0, "review_queue", -10)], 14 ("review_queue", "approve_rotation"): [(1.0, "closed", 80)], 15} 16 17def action_value(state, action, values): 18 return sum( 19 probability * (reward + gamma * values[next_state]) 20 for probability, next_state, reward in transitions[(state, action)] 21 ) 22 23values = {state: 0.0 for state in states} 24for sweep in range(3): 25 updated = values.copy() 26 for state in actions: 27 updated[state] = max( 28 action_value(state, action, values) for action in actions[state] 29 ) 30 values = updated 31 print( 32 f"sweep {sweep + 1}: " 33 f"new={values['new_request']:5.1f} " 34 f"evidence={values['evidence_ready']:5.1f} " 35 f"review={values['review_queue']:5.1f}" 36 ) 37 38policy = { 39 state: max(actions[state], key=lambda action: action_value(state, action, values)) 40 for state in actions 41} 42print("policy =", policy)
Output
1sweep 1: new= -4.0 evidence= 70.0 review= 80.0 2sweep 2: new= 59.0 evidence= 70.0 review= 80.0 3sweep 3: new= 59.0 evidence= 70.0 review= 80.0 4policy = {'new_request': 'request_evidence', 'evidence_ready': 'auto_approve', 'review_queue': 'approve_rotation'}

The first sweep learns terminal-adjacent outcomes. The second propagates them back to the initial request. Value methods can assign credit to an early evidence request whose payoff arrives later.

Transition Risk Can Reverse a Decision

The deterministic model hid an operational risk: some users never provide the requested evidence. Suppose request_evidence has these transitions:

Action from new_requestProbabilityNext stateReward
request_evidence0.75evidence_ready-4
request_evidence0.25closed after abandonment-54

The expected value of requesting evidence is now

0.75[−4+0.90(70)]+0.25[−54]=30.750.75[-4 + 0.90(70)] + 0.25[-54] = 30.750.75[−4+0.90(70)]+0.25[−54]=30.75

while direct review is still −18+0.90(80)=54-18 + 0.90(80) = 54−18+0.90(80)=54. Modeling abandonment flips the recommended action.

stochastic-transition-backup.py
1gamma = 0.90 2values = {"closed": 0.0, "evidence_ready": 70.0, "review_queue": 80.0} 3transitions = { 4 "request_evidence": [ 5 (0.75, "evidence_ready", -4), 6 (0.25, "closed", -54), 7 ], 8 "human_review": [(1.0, "review_queue", -18)], 9} 10 11def expected_action_value(action): 12 return sum( 13 probability * (reward + gamma * values[next_state]) 14 for probability, next_state, reward in transitions[action] 15 ) 16 17for action in transitions: 18 print(f"{action:13} expected return={expected_action_value(action):5.2f}") 19print("risk-aware action =", max(transitions, key=expected_action_value))
Output
1request_evidence expected return=30.75 2human_review expected return=54.00 3risk-aware action = human_review

Expected values describe many similar requests, not a guaranteed outcome for one user. Simulation makes that variation visible.

simulate-policy-returns.py
1import numpy as np 2 3gamma = 0.90 4rng = np.random.default_rng(7) 5 6def request_evidence_return(): 7 if rng.random() < 0.75: 8 return -4 + gamma * 70 9 return -54 10 11evidence_returns = np.array([request_evidence_return() for _ in range(10_000)]) 12review_returns = np.full(10_000, -18 + gamma * 80) 13 14print(f"request_evidence mean={evidence_returns.mean():5.2f}") 15print(f"human_review mean={review_returns.mean():5.2f}") 16print("abandoned evidence routes =", int((evidence_returns < 0).sum()))
Output
1request_evidence mean=30.33 2human_review mean=54.00 3abandoned evidence routes = 2537

Production probabilities can't be invented from one example. They must be estimated from logged episodes, monitored as policy changes alter user behavior, and evaluated on future data.

Q-learning: learn from experience

Value iteration required a transition table. If the router instead observes episodes, it can learn action values Q(s,a)Q(s,a)Q(s,a) from sampled transitions:

Q(s,a)←Q(s,a)+α[r+γmax⁡a′Q(s′,a′)−Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'}Q(s',a') - Q(s,a) \right]Q(s,a)←Q(s,a)+α[r+γa′max​Q(s′,a′)−Q(s,a)]

This is the tabular Q-learning update. Under its standard finite-setting coverage and learning-rate conditions, it converges to optimal action values; practical systems still face limited data, drifting behavior, and unsafe exploration.[2]Reference 2Q-learninghttps://doi.org/10.1007/BF00992698

The lab below uses epsilon-greedy coverage and a visit-count learning rate that shrinks as evidence accumulates.

q-learning-from-routes.py
1import numpy as np 2 3gamma = 0.90 4epsilon = 0.15 5rng = np.random.default_rng(13) 6actions = { 7 "new_request": ["auto_approve", "request_evidence", "human_review"], 8 "evidence_ready": ["auto_approve", "human_review"], 9 "review_queue": ["approve_rotation"], 10} 11 12def step(state, action): 13 if state == "new_request": 14 if action == "auto_approve": 15 return "closed", -120 16 if action == "human_review": 17 return "review_queue", -18 18 if rng.random() < 0.75: 19 return "evidence_ready", -4 20 return "closed", -54 21 if state == "evidence_ready": 22 return ("closed", 70) if action == "auto_approve" else ("review_queue", -10) 23 return "closed", 80 24 25q = { 26 state: {action: 0.0 for action in available_actions} 27 for state, available_actions in actions.items() 28} 29visits = { 30 state: {action: 0 for action in available_actions} 31 for state, available_actions in actions.items() 32} 33for _ in range(20_000): 34 state = "new_request" 35 while state != "closed": 36 available_actions = actions[state] 37 if rng.random() < epsilon: 38 action = rng.choice(available_actions) 39 else: 40 action = max(available_actions, key=lambda candidate: q[state][candidate]) 41 next_state, reward = step(state, action) 42 future = 0.0 if next_state == "closed" else max(q[next_state].values()) 43 target = reward + gamma * future 44 visits[state][action] += 1 45 learning_rate = visits[state][action] ** -0.6 46 q[state][action] += learning_rate * (target - q[state][action]) 47 state = next_state 48 49for action, value in q["new_request"].items(): 50 print(f"Q(new_request, {action:13}) = {value:6.1f}") 51print("learned start action =", max(q["new_request"], key=q["new_request"].get))
Output
1Q(new_request, auto_approve ) = -120.0 2Q(new_request, request_evidence) = 30.5 3Q(new_request, human_review ) = 54.0 4learned start action = human_review

The sampled evidence-route estimate lands near the model-based value 30.75, not exactly on it, because this finite run still contains transition noise. The declining learning rate gives later samples less power to overwrite accumulated evidence.

Exploration matters. If an agent always selects its current best action, an unlucky first impression can prevent it from finding the genuinely stronger route. The next mini-lab collapses the completed routes into one-state action returns to isolate that point.

exploration-matters.py
1import numpy as np 2 3true_returns = { 4 "auto_approve": -120.0, 5 "request_evidence": 30.75, 6 "human_review": 54.0, 7} 8action_order = list(true_returns) 9 10def learn(epsilon, seed): 11 rng = np.random.default_rng(seed) 12 estimates = {action: 0.0 for action in action_order} 13 visits = {action: 0 for action in action_order} 14 for _ in range(400): 15 if rng.random() < epsilon: 16 action = rng.choice(action_order) 17 else: 18 action = max(action_order, key=lambda candidate: estimates[candidate]) 19 visits[action] += 1 20 estimates[action] += ( 21 true_returns[action] - estimates[action] 22 ) / visits[action] 23 selected = max(action_order, key=lambda candidate: estimates[candidate]) 24 return visits, selected 25 26for epsilon in (0.00, 0.10): 27 visits, selected = learn(epsilon, seed=5) 28 print(f"epsilon={epsilon:.2f} visits={visits} selected={selected}")
Output
1epsilon=0.00 visits={'auto_approve': 1, 'request_evidence': 399, 'human_review': 0} selected=request_evidence 2epsilon=0.10 visits={'auto_approve': 19, 'request_evidence': 24, 'human_review': 357} selected=human_review

Unsafe actions can't be explored casually in a live approval system. Offline logs, conservative policy constraints, human approval, or simulation are needed before a learned policy controls costly decisions.

Reward Specification Is Part of the System

An RL policy optimizes the reward it receives, not the intent in a system design document. A flawed score that rewards only short handling time can prefer unsupported automatic approvals.

reward-specification.py
1actions = ["auto_approve", "request_evidence", "human_review"] 2speed_only = { 3 "auto_approve": 100, 4 "request_evidence": 45, 5 "human_review": 15, 6} 7outcome_aware = { 8 "auto_approve": -120, 9 "request_evidence": 31, 10 "human_review": 54, 11} 12 13for name, reward in [("speed_only", speed_only), ("outcome_aware", outcome_aware)]: 14 choice = max(actions, key=lambda action: reward[action]) 15 print(f"{name:13} chooses {choice:13} reward={reward[choice]:4}")
Output
1speed_only chooses auto_approve reward= 100 2outcome_aware chooses human_review reward= 54

Useful reward design records user outcome, policy violations, delays, reviewer cost, abuse risk, and uncertainty. It also audits rare high-harm failures separately from average reward. A high mean return doesn't excuse repeated unsupported approvals.

Policy Gradients and the LLM Bridge

Tabular Q-learning stores a number for every state-action pair. One option for large state or action spaces is a parameterized policy πθ(a∣s)\pi_\theta(a \mid s)πθ​(a∣s). REINFORCE increases the probability of sampled actions whose returns exceed a baseline:

∇θJ(θ)≈1N∑i=1N∇θlog⁡πθ(ai∣si)(Gi−b)\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_\theta \log \pi_\theta(a_i \mid s_i) \left(G_i - b\right)∇θ​J(θ)≈N1​i=1∑N​∇θ​logπθ​(ai​∣si​)(Gi​−b)

An action-independent baseline can lower variance without changing the expected policy gradient.[3]Reference 3Simple statistical gradient-following algorithms for connectionist reinforcement learninghttps://link.springer.com/article/10.1007/BF00992696

To expose the arithmetic without a neural network, the next batch includes one completed route for each action and uses their mean return as the baseline. It reuses softmax from classification to turn one logit per action into action probabilities. The policy gradient then shifts probability toward actions with positive advantage and away from actions with negative advantage. Real training estimates that gradient from sampled trajectories.

reinforce-one-update.py
1import numpy as np 2 3actions = ["auto_approve", "request_evidence", "human_review"] 4returns = np.array([-120.0, 30.75, 54.0]) 5logits = np.zeros(len(actions)) 6 7def softmax(values): 8 shifted = values - values.max() 9 weights = np.exp(shifted) 10 return weights / weights.sum() 11 12probabilities = softmax(logits) 13advantages = returns - returns.mean() 14score_gradients = np.eye(len(actions)) - probabilities 15gradient = (score_gradients * advantages[:, None]).mean(axis=0) 16updated = softmax(logits + 0.03 * gradient) 17 18for action, before, after in zip(actions, probabilities, updated): 19 print(f"{action:13} before={before:.3f} after={after:.3f}")
Output
1auto_approve before=0.333 after=0.089 2request_evidence before=0.333 after=0.403 3human_review before=0.333 after=0.508

For a language model, the state can include a prompt and generated prefix, while actions are token choices. In InstructGPT, human comparisons trained a reward model and PPO optimized a policy against that learned reward with additional controls; that's a much larger system, but the underlying sequence-and-reward vocabulary begins here.[4]Reference 4Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155

Evaluate a Policy on Held-Out Episodes

Training reward isn't a deployment guarantee. A routing policy should be compared on later requests whose outcomes didn't influence its rules, rewards, or thresholds. Use a train, validation, and test split: fit on earlier episodes, tune on separate episodes, then reserve a final untouched period. The next chapter treats that evaluation design carefully; this small table shows the question to ask.

heldout-policy-evaluation.py
1episodes = [ 2 ("required_review", {"auto_approve": -120, "request_evidence": 31, "human_review": 54}), 3 ("required_review", {"auto_approve": -120, "request_evidence": 31, "human_review": 54}), 4 ("clear_evidence", {"auto_approve": 70, "request_evidence": 59, "human_review": 54}), 5 ("clear_evidence", {"auto_approve": 70, "request_evidence": 59, "human_review": 54}), 6] 7policies = { 8 "always_auto": lambda kind: "auto_approve", 9 "always_review": lambda kind: "human_review", 10 "risk_aware": lambda kind: ( 11 "human_review" if kind == "required_review" else "auto_approve" 12 ), 13} 14 15for name, policy in policies.items(): 16 returns = [reward[policy(kind)] for kind, reward in episodes] 17 severe_errors = sum(value <= -100 for value in returns) 18 print( 19 f"{name:13} mean={sum(returns) / len(returns):5.1f} " 20 f"severe_errors={severe_errors}" 21 )
Output
1always_auto mean=-25.0 severe_errors=2 2always_review mean= 54.0 severe_errors=0 3risk_aware mean= 62.0 severe_errors=0

This table is illustrative, not evidence that a production policy works. Real evaluation must prevent future outcomes, repeat users, and policy-tuning decisions from leaking into the held-out estimate.

Practice tasks

  1. Change abandonment probability in stochastic-transition-backup.py. At what probability does human_review overtake request_evidence?
  2. Add a deny_request action and choose rewards that discourage both unsupported approvals and unfair denials. Explain every term.
  3. Modify the Q-learning lab so high-risk actions require a reviewer approval flag. Compare learned reward with and without that constraint.
  4. Replace the hand-authored held-out episodes with a time-ordered sample. List features that wouldn't exist at decision time.

Practice guidance

  1. If abandonment probability is p, evidence-request value is (1 - p)(59) + p(-54) = 59 - 113p. Human review overtakes it when p > 5/113, about 4.4%.
  2. A defensible deny_request reward should include policy correctness and user harm, not processing speed alone. Check whether any shortcut action wins for the wrong reason.
  3. An approval constraint should block or reroute an unsafe action before the environment step. Compare mean reward and count of blocked unsafe attempts; don't optimize mean alone.
  4. Likely leakage fields include evidence_received_at, review outcome, approval reversal, and events caused by router action. Keep the decision-time snapshot separate from the later outcome log.

Key concepts

  • episode and discounted return
  • Markov Decision Process
  • Bellman backup and optimal value
  • value iteration with a known model
  • stochastic transitions and expected value
  • Q-learning from sampled transitions
  • exploration and safety constraints
  • reward specification
  • policy gradients
  • held-out policy evaluation

Evaluation rubric

  • Foundational: Computes discounted returns and explains why an evidence request can beat an immediate approval.
  • Intermediate: Defines the access-routing MDP, implements Bellman backups and value iteration, and explains why abandonment risk changes the policy.
  • Advanced: Implements tabular learning, critiques reward design and exploration safety, and connects policy-gradient updates to LLM post-training without overstating the analogy.

Follow-up questions

Common pitfalls

  • Treating a classifier probability as a complete policy when actions change later observations and outcomes.
  • Rewarding speed while omitting policy violations, user harm, or review cost.
  • Planning from invented transition probabilities instead of measured outcomes.
  • Exploring unsafe routes on live users without approval or conservative constraints.
  • Requesting policy quality from training episodes or from a holdout already used to tune rewards.
  • Saying RLHF is identical to the toy MDP rather than a scaled system with learned rewards and optimization controls.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.With gamma = 0.90 and certain evidence arrival, three routes have rewards auto_approve: [-120], request_evidence: [-4, 70], and human_review: [-18, 80]. Which route has the largest discounted return?
2.In the access-routing MDP, auto_approve is a policy violation before evidence but a supported resolution after evidence arrives. How should the state representation handle this?
3.At evidence_ready, auto_approve gives +70 and closes the case. human_review gives -10 then reaches review_queue, whose optimal value is 80. With gamma = 0.90, what is V*(evidence_ready)?
4.Synchronous value iteration starts with all state values set to 0 in the deterministic workflow and gamma = 0.90. Using previous-sweep values for every update, what happens to V(new_request) over the first two sweeps?
5.In the access-review MDP, request_evidence from new_request succeeds with probability 0.75, giving -4 now and successor value 70, or abandons with probability 0.25, closing with reward -54. With gamma = 0.90, human_review gives -18 now and successor value 80. Which initial action does the expected backup select?
6.An access router lacks a trusted transition model but observes sampled transitions (s, a, r, s'). For discount gamma, which update is the tabular Q-learning rule for learning action values directly from experience?
7.A one-state learner initializes all action-return estimates at 0 and chooses greedily when epsilon = 0. It visits human_review 0 times and selects request_evidence, even though the true return for human_review is 54 versus 30.75 for request_evidence. What conclusion follows for an access-routing learner?
8.A reward table gives auto_approve 100, request_evidence 45, and human_review 15 because it scores only short handling time. In the required-review setting, what failure does this create?
9.A REINFORCE batch has returns auto_approve = -120, request_evidence = 30.75, and human_review = 54. The baseline is the mean return and initial softmax probabilities are equal. In a larger language-model policy, analogous actions could be token choices. Which update direction is correct without overstating the analogy?
10.A routing policy was fit on earlier episodes and tuned after reviewing validation results. Which evaluation setup gives the most credible deployment estimate?

10 questions remaining.

Next Step
Continue to Validation and Leakage

You now have a policy and episode returns; next comes timeline splitting, future-outcome leakage, and the question of whether a learned decision rule generalizes.

PreviousDecision Trees, Forests, and Boosting
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Reinforcement Learning: An Introduction

Sutton, R. S. and Barto, A. G. · 2018 · MIT Press

Q-learning

Watkins, C. J. C. H. and Dayan, P. · 1992 · Machine Learning

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Williams, R. J. · 1992

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022