Learn reinforcement learning through the access-review workflow from earlier lessons. Define an MDP, compute discounted returns and Bellman backups, implement value iteration and Q-learning, model abandonment risk, and connect policy gradients to LLM post-training.
Predicting whether an access-review request needs human review is a one-step decision. A real access assistant must choose a sequence: approve now, request evidence, or route to a reviewer; then react when evidence arrives or the user abandons the request.
Reinforcement learning (RL) studies that sequential decision problem. The agent chooses actions, the environment produces next states and rewards, and the objective is high total reward over an episode. One small access-review workflow gives every equation a concrete operational meaning.[1]
You need Python, small NumPy arrays, and weighted averages. The rewards below are teaching scores, not a production business objective.
Access request acct_10234 is stale and policy P-7 requires review unless strong evidence arrives. Three initial actions are available:
| Initial action | Immediate effect | Reward now | Later best outcome |
|---|---|---|---|
auto_approve | Closes an unsupported required-review request | -120 | Terminal mistake |
request_evidence | Asks for evidence before approving | -4 | Evidence may support an auto approval |
human_review | Enters reviewer queue immediately | -18 | Reviewer can approve correctly |
An episode starts at new_request and ends at closed. Suppose evidence definitely arrives and rewards after the first action are:
| Route | Reward sequence | Meaning |
|---|---|---|
| Approve immediately | [-120] | Fast, but violates policy |
| Request evidence, then approval | [-4, +70] | Small delay, supported resolution |
| Review, then approve | [-18, +80] | More handling cost, correct resolution |
RL compares routes with the discounted return
where . Values below discount delayed reward; we use .
1gamma = 0.90
2routes = {
3 "auto_approve": [-120],
4 "request_evidence": [-4, 70],
5 "human_review": [-18, 80],
6}
7
8def discounted_return(rewards):
9 return sum((gamma**step) * reward for step, reward in enumerate(rewards))
10
11for action, rewards in routes.items():
12 print(f"{action:14} return={discounted_return(rewards):6.1f}")1auto_approve return=-120.0
2request_evidence return= 59.0
3human_review return= 54.0With certain evidence arrival, request_evidence earns , just above human_review at . The route decision depends on future consequences as well as immediate cost.
A Markov Decision Process (MDP) supplies the pieces needed to score a policy:
Our first model is deliberately deterministic:
| State | Available action | Next state | Reward |
|---|---|---|---|
new_request | auto_approve | closed | -120 |
new_request | request_evidence | evidence_ready | -4 |
new_request | human_review | review_queue | -18 |
evidence_ready | auto_approve | closed | +70 |
evidence_ready | human_review | review_queue | -10 |
review_queue | approve_rotation | closed | +80 |
The state must contain information needed for the next decision. new_request and evidence_ready are different states because approving after evidence isn't the same decision as approving without it. Under the Markov assumption, once the current state and action are known, earlier history doesn't change the transition distribution.[1]
A policy describes how an agent chooses actions. The optimal value is the best expected future return achievable from state .
Start at the end:
closed has no future reward, so .review_queue has one useful action, so V^*(\texttt{review_queue}) = 80.evidence_ready chooses between approving for and reviewing for , so its value is .new_request chooses among , , and , so its value is .This backward computation is a Bellman backup:
1gamma = 0.90
2successor_value = {
3 "closed": 0.0,
4 "evidence_ready": 70.0,
5 "review_queue": 80.0,
6}
7start_actions = {
8 "auto_approve": (-120, "closed"),
9 "request_evidence": (-4, "evidence_ready"),
10 "human_review": (-18, "review_queue"),
11}
12
13scores = {
14 action: reward + gamma * successor_value[next_state]
15 for action, (reward, next_state) in start_actions.items()
16}
17for action, score in scores.items():
18 print(f"Q(new_request, {action:13}) = {score:5.1f}")
19print("greedy action =", max(scores, key=scores.get))1Q(new_request, auto_approve ) = -120.0
2Q(new_request, request_evidence) = 59.0
3Q(new_request, human_review ) = 54.0
4greedy action = request_evidence
Hand backups work because this MDP is tiny. Value iteration applies the same backup repeatedly to every non-terminal state until values stop changing. It assumes transition probabilities and rewards are already specified.
1gamma = 0.90
2states = ["new_request", "evidence_ready", "review_queue", "closed"]
3actions = {
4 "new_request": ["auto_approve", "request_evidence", "human_review"],
5 "evidence_ready": ["auto_approve", "human_review"],
6 "review_queue": ["approve_rotation"],
7}
8transitions = {
9 ("new_request", "auto_approve"): [(1.0, "closed", -120)],
10 ("new_request", "request_evidence"): [(1.0, "evidence_ready", -4)],
11 ("new_request", "human_review"): [(1.0, "review_queue", -18)],
12 ("evidence_ready", "auto_approve"): [(1.0, "closed", 70)],
13 ("evidence_ready", "human_review"): [(1.0, "review_queue", -10)],
14 ("review_queue", "approve_rotation"): [(1.0, "closed", 80)],
15}
16
17def action_value(state, action, values):
18 return sum(
19 probability * (reward + gamma * values[next_state])
20 for probability, next_state, reward in transitions[(state, action)]
21 )
22
23values = {state: 0.0 for state in states}
24for sweep in range(3):
25 updated = values.copy()
26 for state in actions:
27 updated[state] = max(
28 action_value(state, action, values) for action in actions[state]
29 )
30 values = updated
31 print(
32 f"sweep {sweep + 1}: "
33 f"new={values['new_request']:5.1f} "
34 f"evidence={values['evidence_ready']:5.1f} "
35 f"review={values['review_queue']:5.1f}"
36 )
37
38policy = {
39 state: max(actions[state], key=lambda action: action_value(state, action, values))
40 for state in actions
41}
42print("policy =", policy)1sweep 1: new= -4.0 evidence= 70.0 review= 80.0
2sweep 2: new= 59.0 evidence= 70.0 review= 80.0
3sweep 3: new= 59.0 evidence= 70.0 review= 80.0
4policy = {'new_request': 'request_evidence', 'evidence_ready': 'auto_approve', 'review_queue': 'approve_rotation'}The first sweep learns terminal-adjacent outcomes. The second propagates them back to the initial request. Value methods can assign credit to an early evidence request whose payoff arrives later.
The deterministic model hid an operational risk: some users never provide the requested evidence. Suppose request_evidence has these transitions:
Action from new_request | Probability | Next state | Reward |
|---|---|---|---|
request_evidence | 0.75 | evidence_ready | -4 |
request_evidence | 0.25 | closed after abandonment | -54 |
The expected value of requesting evidence is now
while direct review is still . Modeling abandonment flips the recommended action.
1gamma = 0.90
2values = {"closed": 0.0, "evidence_ready": 70.0, "review_queue": 80.0}
3transitions = {
4 "request_evidence": [
5 (0.75, "evidence_ready", -4),
6 (0.25, "closed", -54),
7 ],
8 "human_review": [(1.0, "review_queue", -18)],
9}
10
11def expected_action_value(action):
12 return sum(
13 probability * (reward + gamma * values[next_state])
14 for probability, next_state, reward in transitions[action]
15 )
16
17for action in transitions:
18 print(f"{action:13} expected return={expected_action_value(action):5.2f}")
19print("risk-aware action =", max(transitions, key=expected_action_value))1request_evidence expected return=30.75
2human_review expected return=54.00
3risk-aware action = human_reviewExpected values describe many similar requests, not a guaranteed outcome for one user. Simulation makes that variation visible.
1import numpy as np
2
3gamma = 0.90
4rng = np.random.default_rng(7)
5
6def request_evidence_return():
7 if rng.random() < 0.75:
8 return -4 + gamma * 70
9 return -54
10
11evidence_returns = np.array([request_evidence_return() for _ in range(10_000)])
12review_returns = np.full(10_000, -18 + gamma * 80)
13
14print(f"request_evidence mean={evidence_returns.mean():5.2f}")
15print(f"human_review mean={review_returns.mean():5.2f}")
16print("abandoned evidence routes =", int((evidence_returns < 0).sum()))1request_evidence mean=30.33
2human_review mean=54.00
3abandoned evidence routes = 2537Production probabilities can't be invented from one example. They must be estimated from logged episodes, monitored as policy changes alter user behavior, and evaluated on future data.
Value iteration required a transition table. If the router instead observes episodes, it can learn action values from sampled transitions:
This is the tabular Q-learning update. Under its standard finite-setting coverage and learning-rate conditions, it converges to optimal action values; practical systems still face limited data, drifting behavior, and unsafe exploration.[2]
The lab below uses epsilon-greedy coverage and a visit-count learning rate that shrinks as evidence accumulates.
1import numpy as np
2
3gamma = 0.90
4epsilon = 0.15
5rng = np.random.default_rng(13)
6actions = {
7 "new_request": ["auto_approve", "request_evidence", "human_review"],
8 "evidence_ready": ["auto_approve", "human_review"],
9 "review_queue": ["approve_rotation"],
10}
11
12def step(state, action):
13 if state == "new_request":
14 if action == "auto_approve":
15 return "closed", -120
16 if action == "human_review":
17 return "review_queue", -18
18 if rng.random() < 0.75:
19 return "evidence_ready", -4
20 return "closed", -54
21 if state == "evidence_ready":
22 return ("closed", 70) if action == "auto_approve" else ("review_queue", -10)
23 return "closed", 80
24
25q = {
26 state: {action: 0.0 for action in available_actions}
27 for state, available_actions in actions.items()
28}
29visits = {
30 state: {action: 0 for action in available_actions}
31 for state, available_actions in actions.items()
32}
33for _ in range(20_000):
34 state = "new_request"
35 while state != "closed":
36 available_actions = actions[state]
37 if rng.random() < epsilon:
38 action = rng.choice(available_actions)
39 else:
40 action = max(available_actions, key=lambda candidate: q[state][candidate])
41 next_state, reward = step(state, action)
42 future = 0.0 if next_state == "closed" else max(q[next_state].values())
43 target = reward + gamma * future
44 visits[state][action] += 1
45 learning_rate = visits[state][action] ** -0.6
46 q[state][action] += learning_rate * (target - q[state][action])
47 state = next_state
48
49for action, value in q["new_request"].items():
50 print(f"Q(new_request, {action:13}) = {value:6.1f}")
51print("learned start action =", max(q["new_request"], key=q["new_request"].get))1Q(new_request, auto_approve ) = -120.0
2Q(new_request, request_evidence) = 30.5
3Q(new_request, human_review ) = 54.0
4learned start action = human_reviewThe sampled evidence-route estimate lands near the model-based value 30.75, not exactly on it, because this finite run still contains transition noise. The declining learning rate gives later samples less power to overwrite accumulated evidence.
Exploration matters. If an agent always selects its current best action, an unlucky first impression can prevent it from finding the genuinely stronger route. The next mini-lab collapses the completed routes into one-state action returns to isolate that point.
1import numpy as np
2
3true_returns = {
4 "auto_approve": -120.0,
5 "request_evidence": 30.75,
6 "human_review": 54.0,
7}
8action_order = list(true_returns)
9
10def learn(epsilon, seed):
11 rng = np.random.default_rng(seed)
12 estimates = {action: 0.0 for action in action_order}
13 visits = {action: 0 for action in action_order}
14 for _ in range(400):
15 if rng.random() < epsilon:
16 action = rng.choice(action_order)
17 else:
18 action = max(action_order, key=lambda candidate: estimates[candidate])
19 visits[action] += 1
20 estimates[action] += (
21 true_returns[action] - estimates[action]
22 ) / visits[action]
23 selected = max(action_order, key=lambda candidate: estimates[candidate])
24 return visits, selected
25
26for epsilon in (0.00, 0.10):
27 visits, selected = learn(epsilon, seed=5)
28 print(f"epsilon={epsilon:.2f} visits={visits} selected={selected}")1epsilon=0.00 visits={'auto_approve': 1, 'request_evidence': 399, 'human_review': 0} selected=request_evidence
2epsilon=0.10 visits={'auto_approve': 19, 'request_evidence': 24, 'human_review': 357} selected=human_reviewUnsafe actions can't be explored casually in a live approval system. Offline logs, conservative policy constraints, human approval, or simulation are needed before a learned policy controls costly decisions.
An RL policy optimizes the reward it receives, not the intent in a system design document. A flawed score that rewards only short handling time can prefer unsupported automatic approvals.
1actions = ["auto_approve", "request_evidence", "human_review"]
2speed_only = {
3 "auto_approve": 100,
4 "request_evidence": 45,
5 "human_review": 15,
6}
7outcome_aware = {
8 "auto_approve": -120,
9 "request_evidence": 31,
10 "human_review": 54,
11}
12
13for name, reward in [("speed_only", speed_only), ("outcome_aware", outcome_aware)]:
14 choice = max(actions, key=lambda action: reward[action])
15 print(f"{name:13} chooses {choice:13} reward={reward[choice]:4}")1speed_only chooses auto_approve reward= 100
2outcome_aware chooses human_review reward= 54Useful reward design records user outcome, policy violations, delays, reviewer cost, abuse risk, and uncertainty. It also audits rare high-harm failures separately from average reward. A high mean return doesn't excuse repeated unsupported approvals.
Tabular Q-learning stores a number for every state-action pair. One option for large state or action spaces is a parameterized policy . REINFORCE increases the probability of sampled actions whose returns exceed a baseline:
An action-independent baseline can lower variance without changing the expected policy gradient.[3]
To expose the arithmetic without a neural network, the next batch includes one completed route for each action and uses their mean return as the baseline. It reuses softmax from classification to turn one logit per action into action probabilities. The policy gradient then shifts probability toward actions with positive advantage and away from actions with negative advantage. Real training estimates that gradient from sampled trajectories.
1import numpy as np
2
3actions = ["auto_approve", "request_evidence", "human_review"]
4returns = np.array([-120.0, 30.75, 54.0])
5logits = np.zeros(len(actions))
6
7def softmax(values):
8 shifted = values - values.max()
9 weights = np.exp(shifted)
10 return weights / weights.sum()
11
12probabilities = softmax(logits)
13advantages = returns - returns.mean()
14score_gradients = np.eye(len(actions)) - probabilities
15gradient = (score_gradients * advantages[:, None]).mean(axis=0)
16updated = softmax(logits + 0.03 * gradient)
17
18for action, before, after in zip(actions, probabilities, updated):
19 print(f"{action:13} before={before:.3f} after={after:.3f}")1auto_approve before=0.333 after=0.089
2request_evidence before=0.333 after=0.403
3human_review before=0.333 after=0.508For a language model, the state can include a prompt and generated prefix, while actions are token choices. In InstructGPT, human comparisons trained a reward model and PPO optimized a policy against that learned reward with additional controls; that's a much larger system, but the underlying sequence-and-reward vocabulary begins here.[4]
Training reward isn't a deployment guarantee. A routing policy should be compared on later requests whose outcomes didn't influence its rules, rewards, or thresholds. Use a train, validation, and test split: fit on earlier episodes, tune on separate episodes, then reserve a final untouched period. The next chapter treats that evaluation design carefully; this small table shows the question to ask.
1episodes = [
2 ("required_review", {"auto_approve": -120, "request_evidence": 31, "human_review": 54}),
3 ("required_review", {"auto_approve": -120, "request_evidence": 31, "human_review": 54}),
4 ("clear_evidence", {"auto_approve": 70, "request_evidence": 59, "human_review": 54}),
5 ("clear_evidence", {"auto_approve": 70, "request_evidence": 59, "human_review": 54}),
6]
7policies = {
8 "always_auto": lambda kind: "auto_approve",
9 "always_review": lambda kind: "human_review",
10 "risk_aware": lambda kind: (
11 "human_review" if kind == "required_review" else "auto_approve"
12 ),
13}
14
15for name, policy in policies.items():
16 returns = [reward[policy(kind)] for kind, reward in episodes]
17 severe_errors = sum(value <= -100 for value in returns)
18 print(
19 f"{name:13} mean={sum(returns) / len(returns):5.1f} "
20 f"severe_errors={severe_errors}"
21 )1always_auto mean=-25.0 severe_errors=2
2always_review mean= 54.0 severe_errors=0
3risk_aware mean= 62.0 severe_errors=0This table is illustrative, not evidence that a production policy works. Real evaluation must prevent future outcomes, repeat users, and policy-tuning decisions from leaking into the held-out estimate.
stochastic-transition-backup.py. At what probability does human_review overtake request_evidence?deny_request action and choose rewards that discourage both unsupported approvals and unfair denials. Explain every term.p, evidence-request value is (1 - p)(59) + p(-54) = 59 - 113p. Human review overtakes it when p > 5/113, about 4.4%.deny_request reward should include policy correctness and user harm, not processing speed alone. Check whether any shortcut action wins for the wrong reason.evidence_received_at, review outcome, approval reversal, and events caused by router action. Keep the decision-time snapshot separate from the later outcome log.Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Reinforcement Learning: An Introduction
Sutton, R. S. and Barto, A. G. · 2018 · MIT Press
Q-learning
Watkins, C. J. C. H. and Dayan, P. · 1992 · Machine Learning
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Williams, R. J. · 1992
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022