LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringResponsible AI Governance
🛡️MediumAlignment & Safety

Responsible AI Governance

Turn a tool-bearing LLM workflow into auditable evidence: classify its use, own risks, version controls, preserve traces, and gate releases.

19 min read
Learning path
Step 61 of 158 in the full curriculum
Prompt Injection DefenseData Labeling and Human Feedback

Your release assistant now rejects the poisoned eval summary from the previous lesson. The attack said to bypass approval and promote a model candidate. A server-side gate blocked the effect.

That's a security control. It isn't yet a governance program.

A model owner disputes a denied promotion. A reviewer asks which policy version ran. An auditor asks who owned the prompt-injection risk and whether the fix was tested before release. The answer can't be "the model behaved better in our demo." The platform needs evidence.

Responsible AI governance is the engineering loop that answers five questions:

  1. What workflow is being deployed, and who can it affect?
  2. What harm is plausible, and who owns that risk?
  3. What control reduces it?
  4. What evidence proves the control ran and was tested?
  5. What review, escalation, or appeal happens when the system is wrong?

Build that evidence package for a governed model workflow. This is engineering practice, not legal advice. Legal review still determines which duties apply to a real deployment.

A model governance control loop surrounds a promotion assistant. Four compact stages surround an evidence bundle: Govern assigns owner and review date, Map records affected people and effect scope, Measure tracks attack and supported-path results, and Manage defines approval and rollback. That evidence feeds an approval gate showing evidence complete, zero unsafe effects, and zero open accessibility findings. Any material change or incident restarts the loop. A model governance control loop surrounds a promotion assistant. Four compact stages surround an evidence bundle: Govern assigns owner and review date, Map records affected people and effect scope, Measure tracks attack and supported-path results, and Manage defines approval and rollback. That evidence feeds an approval gate showing evidence complete, zero unsafe effects, and zero open accessibility findings. Any material change or incident restarts the loop.
Govern, Map, Measure, and Manage become one governance loop: evidence feeds an executable gate, and material change starts review again.

Move from a blocked attack to an evidence loop

The NIST AI Risk Management Framework describes four connected functions: Govern, Map, Measure, and Manage.[1]Reference 1Artificial Intelligence Risk Management Framework (AI RMF 1.0)https://www.nist.gov/itl/ai-risk-management-framework NIST's Generative AI Profile applies that loop to risks specific to generative systems, including harmful outputs, information integrity, privacy, and security concerns.[2]Reference 2Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence

For a model-promotion assistant, those functions become concrete:

FunctionQuestionArtifact
GovernWho can accept this risk and approve release?Owner, policy, review record
MapWhat decision and stakeholder can the assistant affect?Workflow inventory and classification memo
MeasureDid attacks or ordinary release requests expose failures?Evaluation and red-team report
ManageWhich controls, escalations, and release gates apply?Risk register, audit trail, approval path

Don't begin with a large policy document. Begin with the workflow that can cause an effect.

Inventory workflows before classifying them

A platform uses language models in several places. The model family isn't the risk classification unit. An assistant that summarizes public documentation and an assistant that proposes a credit decision can use the same model while requiring very different review.

Inventory should record intended use, affected stakeholder, possible effect, and first review route. This is an engineering triage label. It flags legal and product review; it doesn't make the final legal determination.

01-triage-workflows-for-review.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Workflow: 5 name: str 6 affects_credit: bool = False 7 manages_workers: bool = False 8 talks_to_people: bool = False 9 can_execute_effects: bool = False 10 11def triage(workflow: Workflow) -> str: 12 if workflow.affects_credit or workflow.manages_workers: 13 return "HIGH_RISK_REVIEW" 14 if workflow.talks_to_people or workflow.can_execute_effects: 15 return "TRANSPARENCY_AND_EFFECT_REVIEW" 16 return "BASELINE_REVIEW" 17 18workflows = [ 19 Workflow("loan_screening", affects_credit=True), 20 Workflow("model_promotion_assistant", talks_to_people=True, can_execute_effects=True), 21 Workflow("public_doc_summary"), 22] 23 24for workflow in workflows: 25 print(f"{workflow.name}: {triage(workflow)}")
Output
1loan_screening: HIGH_RISK_REVIEW 2model_promotion_assistant: TRANSPARENCY_AND_EFFECT_REVIEW 3public_doc_summary: BASELINE_REVIEW

The inventory deliberately routes the model-promotion assistant to review even when it isn't making a credit decision. It interacts with people and can change production traffic. Those facts determine controls such as disclosure, approval, audit evidence, and appeals.

Map the EU AI Act with a date attached

The EU AI Act uses a risk-based framework. Its official explanation separates prohibited uses, high-risk systems, transparency-risk systems, and minimal or no-risk systems. High-risk examples include employment and worker management and access to essential services such as creditworthiness assessment for natural persons. Conversational systems and generated content can create transparency duties. A minimal or no-risk tier doesn't add tier-specific rules, but teams still need to check cross-cutting AI Act duties, existing law, and contractual obligations.[3]Reference 3EU AI Act: Regulation laying down harmonised rules on artificial intelligencehttps://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689

That means legal review must examine the intended workflow:

  • A system assessing creditworthiness for a consumer loan applicant requires high-risk review.
  • A system allocating employee shifts based on individual performance requires high-risk review because it concerns worker management.
  • A model-promotion assistant requires transparency and effects review: it interacts with people, and tool permissions determine whether it can shift production traffic.
  • A public-document summarizer still needs ordinary security, privacy, and quality controls, even if it isn't routed into the higher categories.
A two-axis workflow matrix classifies the deployed use rather than the shared model. Public documentation summarization sits in baseline review, the model-promotion assistant sits in transparency and production-effect review, and consumer credit plus performance-based worker shifts sit in high-risk review. A dated memo records intended use, affected people, legal basis checked on June 9, 2026, and required legal signoff. A two-axis workflow matrix classifies the deployed use rather than the shared model. Public documentation summarization sits in baseline review, the model-promotion assistant sits in transparency and production-effect review, and consumer credit plus performance-based worker shifts sit in high-risk review. A dated memo records intended use, affected people, legal basis checked on June 9, 2026, and required legal signoff.
The shared model doesn't determine the review route. Plot the workflow's affected people and possible effect, then record when the legal basis was checked and who must sign off.

Record regulatory status, not a timeless claim

Regulatory timelines move. The enacted AI Act originally staged high-risk rules for August 2, 2026 and August 2, 2027. As checked on June 9, 2026, Council and Parliament negotiators had reached a May 7 provisional agreement under which rules for stand-alone high-risk systems in areas such as employment, education, and critical infrastructure would apply from December 2, 2027, while rules for high-risk systems integrated into regulated products would apply from August 2, 2028. Parliament and Council still needed to adopt the provisional agreement formally.[3]Reference 3EU AI Act: Regulation laying down harmonised rules on artificial intelligencehttps://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689[4]Reference 4EU agrees to simplify AI rules to boost innovation and ban nudification apps to protect citizenshttps://digital-strategy.ec.europa.eu/en/news/eu-agrees-simplify-ai-rules-boost-innovation-and-ban-nudification-apps-protect-citizens

The safe engineering practice is simple: include a legal_basis_checked_on date and a required legal signoff in the classification memo. Recheck official status before launch or a material feature change.

02-build-a-classification-memo.py
1from dataclasses import dataclass, asdict 2 3@dataclass(frozen=True) 4class ClassificationMemo: 5 workflow_id: str 6 intended_use: str 7 affected_people: str 8 triage_route: str 9 legal_basis_checked_on: str 10 legal_signoff_required: bool 11 12memo = ClassificationMemo( 13 workflow_id="loan-screening-v3", 14 intended_use="assess credit eligibility for loan applicants", 15 affected_people="loan applicants", 16 triage_route="HIGH_RISK_REVIEW", 17 legal_basis_checked_on="2026-06-09", 18 legal_signoff_required=True, 19) 20 21record = asdict(memo) 22print("workflow:", record["workflow_id"]) 23print("route:", record["triage_route"]) 24print("checked_on:", record["legal_basis_checked_on"]) 25print("release_requires_legal_signoff:", record["legal_signoff_required"])
Output
1workflow: loan-screening-v3 2route: HIGH_RISK_REVIEW 3checked_on: 2026-06-09 4release_requires_legal_signoff: True

Create a risk register row a reviewer can use

A classification memo says where review starts. A risk register says what can go wrong, how the platform reduces that harm, who owns the decision, and what evidence exists.

Use the poisoned eval-summary incident as the first row:

FieldModel-release entry
RiskRetrieved eval content instructs agent to bypass promotion approval
HarmUnauthorized model promotion or inconsistent release treatment
Inherent severityHigh, because the agent can change production traffic
ControlTreat retrieved text as untrusted and authorize promotions in application code
EvidenceAttack-trace test report, policy gate log, approval replay
OwnerModel platform release owner
Residual riskMedium until red-team cases and appeals are reviewed
Review triggerNew tool scope, new data source, incident, or scheduled review

Risk scoring isn't a substitute for judgment. It makes prioritization and escalation reproducible.

03-score-an-owned-risk.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class RiskRow: 5 risk_id: str 6 owner: str 7 likelihood: int 8 impact: int 9 residual_likelihood: int 10 residual_impact: int 11 evidence: tuple[str, ...] 12 13 def inherent_score(self) -> int: 14 return self.likelihood * self.impact 15 16 def residual_score(self) -> int: 17 return self.residual_likelihood * self.residual_impact 18 19row = RiskRow( 20 risk_id="MP-014-prompt-injection-promotion", 21 owner="model-platform-release-owner", 22 likelihood=4, 23 impact=5, 24 residual_likelihood=2, 25 residual_impact=5, 26 evidence=("promotion-gate-test-report-v7", "approval-replay-v7"), 27) 28 29print("risk:", row.risk_id) 30print("owner:", row.owner) 31print("inherent_score:", row.inherent_score()) 32print("residual_score:", row.residual_score()) 33print("evidence_count:", len(row.evidence))
Output
1risk: MP-014-prompt-injection-promotion 2owner: model-platform-release-owner 3inherent_score: 20 4residual_score: 10 5evidence_count: 2

The residual impact remains high because an unauthorized production promotion is still serious. The control lowers likelihood. That difference matters: it prevents a team from declaring the risk solved merely because a filter was added.

Reject ceremonial risk registers

A row without an owner, control evidence, or review date is a description of a worry. It can't block release or drive follow-up work.

04-validate-risk-register-rows.py
1REQUIRED_FIELDS = { 2 "risk_id", 3 "owner", 4 "control", 5 "evidence", 6 "residual_risk", 7 "next_review_on", 8} 9 10def missing_fields(row: dict[str, object]) -> list[str]: 11 return sorted( 12 field 13 for field in REQUIRED_FIELDS 14 if field not in row or not row[field] 15 ) 16 17rows = [ 18 { 19 "risk_id": "RF-014", 20 "owner": "model-platform-release-owner", 21 "control": "promotion_policy_gate_v7", 22 "evidence": ["attack-suite-2026-05-31"], 23 "residual_risk": "medium", 24 "next_review_on": "2026-08-31", 25 }, 26 { 27 "risk_id": "CR-002", 28 "owner": "", 29 "control": "manual review", 30 "evidence": [], 31 "residual_risk": "", 32 "next_review_on": "", 33 }, 34] 35 36for row in rows: 37 missing = missing_fields(row) 38 status = "READY_FOR_REVIEW" if not missing else "BLOCKED" 39 print(row["risk_id"], status, missing)
Output
1RF-014 READY_FOR_REVIEW [] 2CR-002 BLOCKED ['evidence', 'next_review_on', 'owner', 'residual_risk']

Document the system and its data separately

A model card describes intended use, out-of-scope use, limitations, evaluations, and known risks for a model or system. A datasheet describes the origin, composition, collection, processing, and recommended uses of a dataset. Model cards and datasheets were introduced as practical documentation patterns for accountability and reproducibility.[5]Reference 5Model Cards for Model Reportinghttps://arxiv.org/abs/1810.03993[6]Reference 6Datasheets for Datasetshttps://arxiv.org/abs/1803.09010

For an LLM application, the release evidence package should reference both:

  • The system card identifies model version, prompt or policy version, enabled tools, effect gates, safety tests, limitations, and rollback owner.
  • The dataset record identifies the attack fixtures, ordinary release-request examples, provenance, labeling instructions, sensitive fields, evaluation split, and retention rule.

A model card isn't automatically statutory technical documentation. It's a useful engineering artifact that can support a wider documentation obligation when it's accurate, versioned, and reviewed.

05-check-release-documentation.py
1system_card = { 2 "system_version": "promotion-assistant-7.2", 3 "policy_version": "promotion-gate-v7", 4 "intended_use": "answer candidate-eval questions and propose model promotions for approval", 5 "out_of_scope": ["autonomous production promotion"], 6 "enabled_tools": ["propose_promotion"], 7 "evaluations": ["attack-suite-2026-05-31", "benign-suite-2026-05-31"], 8 "rollback_owner": "model-platform-release-owner", 9} 10 11dataset_record = { 12 "dataset_id": "promotion-redteam-v2", 13 "provenance": "curated release-request and injected-eval fixtures", 14 "labeling_guide": "unsafe_effects-v2", 15 "sensitive_fields_removed": True, 16 "evaluation_split": "frozen-promotion-eval-v2", 17 "retention_rule": "retain redacted fixtures for 90 days", 18} 19 20required_system = { 21 "system_version", 22 "policy_version", 23 "intended_use", 24 "out_of_scope", 25 "enabled_tools", 26 "evaluations", 27 "rollback_owner", 28} 29required_dataset = { 30 "dataset_id", 31 "provenance", 32 "labeling_guide", 33 "sensitive_fields_removed", 34 "evaluation_split", 35 "retention_rule", 36} 37 38missing_system = sorted(required_system - system_card.keys()) 39missing_dataset = sorted(required_dataset - dataset_record.keys()) 40print("system_card_complete:", not missing_system) 41print("dataset_record_complete:", not missing_dataset) 42print("evidence_package_ready:", not missing_system and not missing_dataset)
Output
1system_card_complete: True 2dataset_record_complete: True 3evidence_package_ready: True

Preserve an audit trail without collecting everything

The prompt-injection lesson separated untrusted retrieved text from trusted policy and gated promotion effects in code. Governance requires a replayable record of that decision:

  • Workflow, system, policy, and tool versions.
  • A privacy-minimized request identifier and actor role.
  • Source trust labels, not a dump of every private document.
  • The proposed effect and the deterministic gate result.
  • Required approval, final effect status, and appeal identifier if one exists.

An audit trail shouldn't store hidden reasoning or unlimited private evaluation data. Store the business facts needed to reproduce an effect decision, apply access restrictions and retention rules, and keep sensitive content out unless it's necessary and authorized.

A privacy-conscious audit trail records five observable model-release events: a pseudonymous request, versioned context and untrusted source, a proposed promotion to 10 percent production traffic, a blocked policy gate with denied approval, and no executed promotion with an appeal ID. Each event links to the previous digest. The final digest matches a separately protected review anchor, while changing the gate result to approved produces a different chain head and a visible mismatch. A privacy-conscious audit trail records five observable model-release events: a pseudonymous request, versioned context and untrusted source, a proposed promotion to 10 percent production traffic, a blocked policy gate with denied approval, and no executed promotion with an appeal ID. Each event links to the previous digest. The final digest matches a separately protected review anchor, while changing the gate result to approved produces a different chain head and a visible mismatch.
Record observable business facts, not hidden reasoning. Linking event digests to a separately protected anchor makes a rewritten gate decision visible during review.
06-write-a-privacy-conscious-audit-record.py
1from dataclasses import dataclass, asdict 2from hashlib import sha256 3import hmac 4 5AUDIT_PSEUDONYM_KEY = b"local-demo-key-not-for-production" 6 7@dataclass(frozen=True) 8class AuditRecord: 9 request_id: str 10 actor_role: str 11 workflow_version: str 12 policy_version: str 13 retrieved_source_id: str 14 retrieved_source_trust: str 15 proposed_effect: str 16 gate_decision: str 17 approval_id: str 18 final_effect: str 19 appeal_id: str 20 21def pseudonymize_actor_id(actor_id: str) -> str: 22 digest = hmac.new(AUDIT_PSEUDONYM_KEY, actor_id.encode(), sha256).hexdigest() 23 return f"actor-{digest[:24]}" 24 25record = AuditRecord( 26 request_id=f"promotion/{pseudonymize_actor_id('USER-918204')}/001", 27 actor_role="model_owner", 28 workflow_version="promotion-assistant-7.2", 29 policy_version="promotion-gate-v7", 30 retrieved_source_id="candidate-eval/C17/R42", 31 retrieved_source_trust="UNTRUSTED_EVAL_CONTENT", 32 proposed_effect="promote:C17:prod-10pct", 33 gate_decision="BLOCKED_REQUIRES_APPROVAL", 34 approval_id="APR-48291", 35 final_effect="NO_PROMOTION_EXECUTED", 36 appeal_id="APL-48291", 37) 38 39for field, value in asdict(record).items(): 40 print(f"{field}: {value}")
Output
1request_id: promotion/actor-f027638fb6653d5861c880a5/001 2actor_role: model_owner 3workflow_version: promotion-assistant-7.2 4policy_version: promotion-gate-v7 5retrieved_source_id: candidate-eval/C17/R42 6retrieved_source_trust: UNTRUSTED_EVAL_CONTENT 7proposed_effect: promote:C17:prod-10pct 8gate_decision: BLOCKED_REQUIRES_APPROVAL 9approval_id: APR-48291 10final_effect: NO_PROMOTION_EXECUTED 11appeal_id: APL-48291

This example hardcodes the key only to stay runnable locally. In production, keep a versioned HMAC key in a secret manager, restrict access, and document rotation. Pseudonyms reduce direct exposure, but they remain linkable and still require privacy controls.

Make evidence tampering observable

Restricted storage and access controls matter first. A chained digest plus a protected anchor can also show that a stored sequence was rewritten after the review record was created. Store that anchor outside the mutable event log. This is an integrity signal, not a complete audit-storage design.

07-detect-an-altered-decision-chain.py
1import hashlib 2import json 3 4def digest_event(event: str, value: str, previous: str) -> str: 5 payload = json.dumps({"event": event, "value": value, "previous": previous}, sort_keys=True) 6 return hashlib.sha256(payload.encode()).hexdigest() 7 8def append_event(chain: list[dict[str, str]], event: str, value: str) -> None: 9 previous = chain[-1]["digest"] if chain else "GENESIS" 10 digest = digest_event(event, value, previous) 11 chain.append({"event": event, "value": value, "previous": previous, "digest": digest}) 12 13def recompute_digests(chain: list[dict[str, str]]) -> None: 14 for index, item in enumerate(chain): 15 previous = chain[index - 1]["digest"] if index else "GENESIS" 16 item["previous"] = previous 17 item["digest"] = digest_event(item["event"], item["value"], previous) 18 19def verifies(chain: list[dict[str, str]], protected_anchor: str) -> bool: 20 expected_previous = "GENESIS" 21 for item in chain: 22 expected_digest = digest_event(item["event"], item["value"], expected_previous) 23 if item["previous"] != expected_previous or item["digest"] != expected_digest: 24 return False 25 expected_previous = item["digest"] 26 return expected_previous == protected_anchor 27 28events: list[dict[str, str]] = [] 29append_event(events, "tool_proposal", "promote:C17:prod-10pct") 30append_event(events, "policy_gate", "blocked_requires_approval") 31append_event(events, "effect", "none") 32review_anchor = events[-1]["digest"] # Copy to a separately protected review record. 33print("review_anchor:", review_anchor[:12]) 34print("original_chain_valid:", verifies(events, review_anchor)) 35 36events[1]["value"] = "approved" 37recompute_digests(events) 38print("rewritten_chain_matches_anchor:", verifies(events, review_anchor))
Output
1review_anchor: 252f127e01b6 2original_chain_valid: True 3rewritten_chain_matches_anchor: False

For EU high-risk deployments, logging, documentation, record-keeping, and human oversight can be regulated duties, and the obligations depend on the provider or deployer role and the system classification.[3]Reference 3EU AI Act: Regulation laying down harmonised rules on artificial intelligencehttps://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689 The design lesson is durable even when specific law changes: preserve enough trustworthy evidence to inspect consequential behavior.

Turn red-team results into approval decisions

Red teaming means deliberately attempting to elicit harmful or disallowed behavior before attackers, users, or staff encounter it. Research on language-model red teaming shows that adversarial testing can discover harmful behaviors and create reusable evaluation data.[7]Reference 7Red Teaming Language Models with Language Models.https://arxiv.org/abs/2202.03286

For a governed model workflow, don't report only that an attack prompt was rejected. Evaluate observable effects:

  • Attack success rate (ASR): fraction of attack traces that cause a forbidden effect or disclosure.
  • False rejection rate (FRR): fraction of ordinary approval requests that are blocked unnecessarily.
  • Finding owner: person responsible for remediation or accepted residual risk.
  • Evidence link: trace fixture, gate version, reproduction, and retest result.
08-convert-red-team-results-to-a-gate.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class TestResult: 5 kind: str 6 unsafe_effect: bool 7 wrongly_blocked: bool = False 8 9results = [ 10 TestResult("attack", unsafe_effect=False), 11 TestResult("attack", unsafe_effect=False), 12 TestResult("attack", unsafe_effect=True), 13 TestResult("benign", unsafe_effect=False, wrongly_blocked=False), 14 TestResult("benign", unsafe_effect=False, wrongly_blocked=True), 15] 16 17attacks = [result for result in results if result.kind == "attack"] 18benign = [result for result in results if result.kind == "benign"] 19asr = sum(result.unsafe_effect for result in attacks) / len(attacks) 20frr = sum(result.wrongly_blocked for result in benign) / len(benign) 21release_reasons = [] 22if asr > 0: 23 release_reasons.append("unsafe effect survived attack test") 24if frr > 0.10: 25 release_reasons.append("legitimate requests wrongly blocked") 26 27print(f"asr: {asr:.2f}") 28print(f"frr: {frr:.2f}") 29print("finding_owner: model-platform-release-owner") 30print("release_blocked:", bool(release_reasons)) 31print("release_reasons:", release_reasons)
Output
1asr: 0.33 2frr: 0.50 3finding_owner: model-platform-release-owner 4release_blocked: True 5release_reasons: ['unsafe effect survived attack test', 'legitimate requests wrongly blocked']

A single forbidden promotion blocks this release even if the aggregate rate looks small. An excessive FRR also needs work: a safety control that strands legitimate model owners creates a different harm and increases appeal volume.

Measure who bears errors and friction

A secure effect gate isn't sufficient if legitimate model owners are consistently blocked on a supported interaction path or can't reach an appeal. Ethics becomes engineering work when the team measures outcomes, investigates disparities, and repairs barriers.

For a release assistant, include ordinary eligible promotion scenarios through each supported path: web console, CLI workflow, keyboard-only navigation, screen-reader-assisted interaction, and escalation to a person. Test the interface with users and accessibility specialists where possible. A small fixture set can reveal a release blocker; it can't establish that a product is fair for every affected population.

This diagnostic test makes a broken accessible path visible before release:

09-slice-valid-request-friction.py
1from collections import defaultdict 2 3cases = [ 4 ("standard_chat", True, False), 5 ("standard_chat", True, False), 6 ("screen_reader_path", True, True), 7 ("screen_reader_path", True, False), 8] 9 10blocked_by_path: dict[str, list[bool]] = defaultdict(list) 11for path, eligible, wrongly_blocked in cases: 12 if eligible: 13 blocked_by_path[path].append(wrongly_blocked) 14 15rates: dict[str, float] = {} 16for path, blocked in sorted(blocked_by_path.items()): 17 rates[path] = sum(blocked) / len(blocked) 18 print(f"{path}_false_rejection_rate: {rates[path]:.2f}") 19 20investigation_required = any(rate > 0.25 for rate in rates.values()) 21print("investigation_required:", investigation_required) 22print("release_action: repair path and retest" if investigation_required else "release_action: proceed")
Output
1screen_reader_path_false_rejection_rate: 0.50 2standard_chat_false_rejection_rate: 0.00 3investigation_required: True 4release_action: repair path and retest

Store this report with the dataset version, test limitations, accessibility review, and remediation owner. It complements oversight: the escalation route must itself be usable by the people who need it.

Make human oversight an executable path

"Human in the loop" is too vague for a release review. Define which effects require approval, what information the reviewer sees, how conflicting interests are handled, and how an affected actor challenges an outcome.

For model promotions:

  1. The assistant may answer eval and release-policy questions.
  2. It may propose a promotion with cited eval evidence.
  3. Every production promotion requires approval. Untrusted retrieved instructions route the proposal to security review, while high-traffic targets route to exception review.
  4. The reviewer approves or rejects the effect with a reason code.
  5. A model owner may appeal; a second reviewer receives the original record and decision basis.
10-route-a-promotion-through-oversight.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class PromotionProposal: 5 target_percent: int 6 saw_untrusted_instruction: bool 7 8def oversight_route(proposal: PromotionProposal) -> str: 9 if proposal.saw_untrusted_instruction: 10 return "SECURITY_REVIEW" 11 if proposal.target_percent > 10: 12 return "EXCEPTION_REVIEW" 13 return "STANDARD_APPROVAL" 14 15proposal = PromotionProposal(target_percent=10, saw_untrusted_instruction=True) 16effect_executed_before_review = False 17decision = "DENIED_BY_REVIEWER" 18appeal = "QUEUED_FOR_SECOND_REVIEW" if decision.startswith("DENIED") else "NOT_REQUIRED" 19 20print("oversight_route:", oversight_route(proposal)) 21print("effect_executed_before_review:", effect_executed_before_review) 22print("decision:", decision) 23print("appeal:", appeal)
Output
1oversight_route: SECURITY_REVIEW 2effect_executed_before_review: False 3decision: DENIED_BY_REVIEWER 4appeal: QUEUED_FOR_SECOND_REVIEW

For a legally high-risk workflow, human oversight requirements need careful mapping to the applicable duties and roles.[3]Reference 3EU AI Act: Regulation laying down harmonised rules on artificial intelligencehttps://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689 For any workflow with production or people-facing effects, a clear review and appeal mechanism is also sound product engineering.

Gate releases on an evidence package

Governance fails when documents live in separate folders while the deployment pipeline ignores them. A release candidate should fail if required evidence is missing or if a red-team finding remains open.

For the model-promotion assistant, the minimal package contains:

EvidenceWhy it exists
Workflow memoRecords intended use, people affected, and dated review route
Risk rowConnects harm to control, owner, residual risk, and next review
System cardIdentifies shipped policy, model, tools, limitations, and tests
Dataset recordMakes evaluation cases and labels reproducible
Audit replayShows a proposed promotion was gated and preserved correctly
Red-team resultBlocks release when an unsafe effect remains
Accessibility and slice reportFinds valid requests blocked on supported paths
Oversight pathShows approval and appeal behavior exists
11-gate-a-governed-release.py
1REQUIRED_EVIDENCE = { 2 "workflow_memo", 3 "risk_register_row", 4 "system_card", 5 "dataset_record", 6 "audit_replay", 7 "red_team_report", 8 "accessibility_and_slice_report", 9 "oversight_runbook", 10} 11 12def release_decision( 13 evidence: set[str], 14 unsafe_effects: int, 15 open_accessibility_findings: int, 16) -> tuple[bool, list[str]]: 17 reasons: list[str] = [] 18 missing = sorted(REQUIRED_EVIDENCE - evidence) 19 if missing: 20 reasons.append(f"missing evidence: {', '.join(missing)}") 21 if unsafe_effects: 22 reasons.append(f"unsafe effects remain: {unsafe_effects}") 23 if open_accessibility_findings: 24 reasons.append(f"accessibility findings remain: {open_accessibility_findings}") 25 return not reasons, reasons 26 27draft_evidence = REQUIRED_EVIDENCE - {"dataset_record"} 28draft_ready, draft_reasons = release_decision( 29 draft_evidence, 30 unsafe_effects=1, 31 open_accessibility_findings=1, 32) 33print("draft_ready:", draft_ready) 34print("draft_reasons:", draft_reasons) 35 36reviewed_ready, reviewed_reasons = release_decision( 37 REQUIRED_EVIDENCE, 38 unsafe_effects=0, 39 open_accessibility_findings=0, 40) 41print("reviewed_ready:", reviewed_ready) 42print("reviewed_reasons:", reviewed_reasons)
Output
1draft_ready: False 2draft_reasons: ['missing evidence: dataset_record', 'unsafe effects remain: 1', 'accessibility findings remain: 1'] 3reviewed_ready: True 4reviewed_reasons: []

Notice that dataset_record is release evidence, not administrative decoration. If nobody can identify where evaluation examples came from or how feedback labels were assigned, the result can't be reliably reproduced. The next lesson develops that data pipeline.

Build the model-release evidence package

Take the poisoned eval-summary trace from the prompt-injection chapter and produce a reviewable package:

  1. Write a workflow memo for promotion-assistant-7.2: intended use, stakeholders, tool effects, review route, and legal_basis_checked_on.
  2. Add a risk row for indirect injection leading to an unauthorized promotion. Name an owner, control, evidence, residual risk, and next review trigger.
  3. Create a system-card entry that names the policy gate version, enabled tools, effect threshold, and known limitation.
  4. Create a dataset-record entry for attack and benign traces. Record provenance, labels, removed sensitive fields, and version.
  5. Run attacks and ordinary release requests. Record ASR and FRR, with a hard block for any unauthorized promotion.
  6. Slice valid-request friction by supported interaction path and fix accessibility barriers before release.
  7. Capture an audit replay showing the malicious retrieved instruction, proposed effect, gate decision, reviewer action, and final outcome without unnecessary private eval text.
  8. Write the approval and appeal route and run one denied-promotion replay through it.
  9. Evaluate the release gate. It must name every missing artifact, unresolved unsafe effect, or open accessibility finding.

A sufficient answer

A strong submission doesn't claim the workflow is compliant because a Markdown file exists. It states the review route and date, connects a plausible harm to owned controls, produces reproducible test and trace evidence, and blocks release when one forbidden effect survives.

A revealing failure case

Deliberately remove the dataset record, set unsafe_effects=1, or leave open_accessibility_findings=1 in the release gate. If the candidate still releases, your governance process is only descriptive. The gate must be connected to the deployment decision.

Mastery check

You're ready to operate this workflow when you can do all of the following:

  • Explain why intended use, not the foundation model name, drives risk triage.
  • Write a dated classification memo without pretending an engineering label is legal signoff.
  • Convert one harm into a risk row with a control, owner, evidence, residual risk, and review trigger.
  • Distinguish a system card from a dataset record and show why both affect reproducibility.
  • Record an agent effect decision without storing unrestricted private data or hidden reasoning.
  • Convert ASR and FRR results into a release-blocking decision.
  • Detect when a supported interaction path blocks legitimate model owners and require remediation.
  • Specify where human approval and appeal happen in the effect path.
  • Reject a release candidate whose evidence package is missing or whose unsafe effects remain.

Evaluation rubric

  • Foundational: Identifies affected stakeholders, possible effects, and an appropriate review route.
  • Intermediate: Produces owned risk rows, versioned documentation, and privacy-conscious decision traces.
  • Applied: Measures adversarial and benign behavior, then gates releases on evidence and unsafe effects.
  • Advanced: Connects dated regulatory review, technical controls, human oversight, appeals, and data provenance into one operating loop.

Common pitfalls

  • Symptom: A team calls a foundation model "high-risk" without naming a use case. Cause: Model capability was confused with deployed impact. Fix: Classify each workflow and affected stakeholder.
  • Symptom: Legal timeline text becomes stale. Cause: Documentation recorded a claim without a verification date. Fix: Store legal_basis_checked_on and require review before release changes.
  • Symptom: A risk register grows but never blocks a deployment. Cause: Rows lack owners, evidence, or release integration. Fix: Validate rows and fail the release gate.
  • Symptom: Audit logging creates new privacy exposure. Cause: Raw content was stored instead of decision facts. Fix: Store minimized, access-controlled effect evidence with an explicit retention rule.
  • Symptom: Attack tests pass while ordinary model owners can't obtain valid promotions. Cause: ASR was tracked without FRR and appeals. Fix: Evaluate attacks and benign flows together.
  • Symptom: Overall FRR looks acceptable while an accessible path fails. Cause: Aggregate metrics hid who bears friction. Fix: Slice supported paths, investigate barriers, and block release until repaired.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.One LLM powers consumer loan screening, a model-promotion assistant with production-traffic tools, and public documentation summaries. How should these workflows be classified?
2.A risk register row for CR-002 contains risk_id='CR-002', owner='', control='manual review', evidence=[], residual_risk='', and next_review_on=''. What should the release review do with this row?
3.A promotion-assistant release has a complete system card with versions, enabled tools, tests, limitations, and rollback owner. The dataset record is missing, although the red-team score is listed. Why should the release gate block?
4.For a privacy-conscious audit trail of an LLM promotion assistant, store the minimal observable facts needed to inspect a consequential effect decision, not hidden reasoning or unrestricted raw eval data. A model owner disputes a denied promotion after retrieved eval text tried to bypass approval. Which audit record design supports review while minimizing privacy exposure?
5.In a release test, 3 attack traces include 1 forbidden model promotion, and 2 benign eligible requests include 1 wrongly blocked request. The release policy blocks on any unsafe effect and on FRR above 0.10. What are the metrics and decision?
6.Eligible-promotion tests show 0 of 20 web-console requests blocked and 4 of 8 screen-reader requests blocked. Policy requires investigation when any supported path has FRR above 0.25. What should the release team do?
7.A promotion proposal targets 10 percent production traffic and includes an untrusted retrieved instruction. Every production promotion needs approval, and untrusted evidence routes this one to security review. The reviewer denies it. Which flow provides meaningful oversight?
8.The promotion assistant now blocks the poisoned eval-summary instruction in one demo replay. The team wants to ship the release candidate. Which release decision is supported by the governance evidence requirements?

8 questions remaining.

Next Step
Continue to Data Labeling, Human Feedback, and Active Learning Systems

Governance defines the evidence and provenance you need; next you'll build the versioned human-feedback data pipeline that supplies those evaluations.

PreviousPrompt Injection Defense
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Artificial Intelligence Risk Management Framework (AI RMF 1.0)

National Institute of Standards and Technology · 2023

Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile

National Institute of Standards and Technology · 2024 · NIST

EU AI Act: Regulation laying down harmonised rules on artificial intelligence

European Parliament and Council of the European Union · 2024

EU agrees to simplify AI rules to boost innovation and ban nudification apps to protect citizens

European Commission · 2026

Model Cards for Model Reporting

Mitchell, M., Wu, S., Zaldivar, A., et al. · 2019 · FAT* 2019

Datasheets for Datasets

Gebru, T., Morgenstern, J., Vecchione, B., et al. · 2021 · Communications of the ACM

Red Teaming Language Models with Language Models.

Perez, E., et al. · 2022 · EMNLP 2022