Turn a tool-bearing LLM workflow into auditable evidence: classify its use, own risks, version controls, preserve traces, and gate releases.
Your release assistant now rejects the poisoned eval summary from the previous lesson. The attack said to bypass approval and promote a model candidate. A server-side gate blocked the effect.
That's a security control. It isn't yet a governance program.
A model owner disputes a denied promotion. A reviewer asks which policy version ran. An auditor asks who owned the prompt-injection risk and whether the fix was tested before release. The answer can't be "the model behaved better in our demo." The platform needs evidence.
Responsible AI governance is the engineering loop that answers five questions:
Build that evidence package for a governed model workflow. This is engineering practice, not legal advice. Legal review still determines which duties apply to a real deployment.
The NIST AI Risk Management Framework describes four connected functions: Govern, Map, Measure, and Manage.[1] NIST's Generative AI Profile applies that loop to risks specific to generative systems, including harmful outputs, information integrity, privacy, and security concerns.[2]
For a model-promotion assistant, those functions become concrete:
| Function | Question | Artifact |
|---|---|---|
| Govern | Who can accept this risk and approve release? | Owner, policy, review record |
| Map | What decision and stakeholder can the assistant affect? | Workflow inventory and classification memo |
| Measure | Did attacks or ordinary release requests expose failures? | Evaluation and red-team report |
| Manage | Which controls, escalations, and release gates apply? | Risk register, audit trail, approval path |
Don't begin with a large policy document. Begin with the workflow that can cause an effect.
A platform uses language models in several places. The model family isn't the risk classification unit. An assistant that summarizes public documentation and an assistant that proposes a credit decision can use the same model while requiring very different review.
Inventory should record intended use, affected stakeholder, possible effect, and first review route. This is an engineering triage label. It flags legal and product review; it doesn't make the final legal determination.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Workflow:
5 name: str
6 affects_credit: bool = False
7 manages_workers: bool = False
8 talks_to_people: bool = False
9 can_execute_effects: bool = False
10
11def triage(workflow: Workflow) -> str:
12 if workflow.affects_credit or workflow.manages_workers:
13 return "HIGH_RISK_REVIEW"
14 if workflow.talks_to_people or workflow.can_execute_effects:
15 return "TRANSPARENCY_AND_EFFECT_REVIEW"
16 return "BASELINE_REVIEW"
17
18workflows = [
19 Workflow("loan_screening", affects_credit=True),
20 Workflow("model_promotion_assistant", talks_to_people=True, can_execute_effects=True),
21 Workflow("public_doc_summary"),
22]
23
24for workflow in workflows:
25 print(f"{workflow.name}: {triage(workflow)}")1loan_screening: HIGH_RISK_REVIEW
2model_promotion_assistant: TRANSPARENCY_AND_EFFECT_REVIEW
3public_doc_summary: BASELINE_REVIEWThe inventory deliberately routes the model-promotion assistant to review even when it isn't making a credit decision. It interacts with people and can change production traffic. Those facts determine controls such as disclosure, approval, audit evidence, and appeals.
The EU AI Act uses a risk-based framework. Its official explanation separates prohibited uses, high-risk systems, transparency-risk systems, and minimal or no-risk systems. High-risk examples include employment and worker management and access to essential services such as creditworthiness assessment for natural persons. Conversational systems and generated content can create transparency duties. A minimal or no-risk tier doesn't add tier-specific rules, but teams still need to check cross-cutting AI Act duties, existing law, and contractual obligations.[3]
That means legal review must examine the intended workflow:
Regulatory timelines move. The enacted AI Act originally staged high-risk rules for August 2, 2026 and August 2, 2027. As checked on June 9, 2026, Council and Parliament negotiators had reached a May 7 provisional agreement under which rules for stand-alone high-risk systems in areas such as employment, education, and critical infrastructure would apply from December 2, 2027, while rules for high-risk systems integrated into regulated products would apply from August 2, 2028. Parliament and Council still needed to adopt the provisional agreement formally.[3][4]
The safe engineering practice is simple: include a legal_basis_checked_on date and a required legal signoff in the classification memo. Recheck official status before launch or a material feature change.
1from dataclasses import dataclass, asdict
2
3@dataclass(frozen=True)
4class ClassificationMemo:
5 workflow_id: str
6 intended_use: str
7 affected_people: str
8 triage_route: str
9 legal_basis_checked_on: str
10 legal_signoff_required: bool
11
12memo = ClassificationMemo(
13 workflow_id="loan-screening-v3",
14 intended_use="assess credit eligibility for loan applicants",
15 affected_people="loan applicants",
16 triage_route="HIGH_RISK_REVIEW",
17 legal_basis_checked_on="2026-06-09",
18 legal_signoff_required=True,
19)
20
21record = asdict(memo)
22print("workflow:", record["workflow_id"])
23print("route:", record["triage_route"])
24print("checked_on:", record["legal_basis_checked_on"])
25print("release_requires_legal_signoff:", record["legal_signoff_required"])1workflow: loan-screening-v3
2route: HIGH_RISK_REVIEW
3checked_on: 2026-06-09
4release_requires_legal_signoff: TrueA classification memo says where review starts. A risk register says what can go wrong, how the platform reduces that harm, who owns the decision, and what evidence exists.
Use the poisoned eval-summary incident as the first row:
| Field | Model-release entry |
|---|---|
| Risk | Retrieved eval content instructs agent to bypass promotion approval |
| Harm | Unauthorized model promotion or inconsistent release treatment |
| Inherent severity | High, because the agent can change production traffic |
| Control | Treat retrieved text as untrusted and authorize promotions in application code |
| Evidence | Attack-trace test report, policy gate log, approval replay |
| Owner | Model platform release owner |
| Residual risk | Medium until red-team cases and appeals are reviewed |
| Review trigger | New tool scope, new data source, incident, or scheduled review |
Risk scoring isn't a substitute for judgment. It makes prioritization and escalation reproducible.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class RiskRow:
5 risk_id: str
6 owner: str
7 likelihood: int
8 impact: int
9 residual_likelihood: int
10 residual_impact: int
11 evidence: tuple[str, ...]
12
13 def inherent_score(self) -> int:
14 return self.likelihood * self.impact
15
16 def residual_score(self) -> int:
17 return self.residual_likelihood * self.residual_impact
18
19row = RiskRow(
20 risk_id="MP-014-prompt-injection-promotion",
21 owner="model-platform-release-owner",
22 likelihood=4,
23 impact=5,
24 residual_likelihood=2,
25 residual_impact=5,
26 evidence=("promotion-gate-test-report-v7", "approval-replay-v7"),
27)
28
29print("risk:", row.risk_id)
30print("owner:", row.owner)
31print("inherent_score:", row.inherent_score())
32print("residual_score:", row.residual_score())
33print("evidence_count:", len(row.evidence))1risk: MP-014-prompt-injection-promotion
2owner: model-platform-release-owner
3inherent_score: 20
4residual_score: 10
5evidence_count: 2The residual impact remains high because an unauthorized production promotion is still serious. The control lowers likelihood. That difference matters: it prevents a team from declaring the risk solved merely because a filter was added.
A row without an owner, control evidence, or review date is a description of a worry. It can't block release or drive follow-up work.
1REQUIRED_FIELDS = {
2 "risk_id",
3 "owner",
4 "control",
5 "evidence",
6 "residual_risk",
7 "next_review_on",
8}
9
10def missing_fields(row: dict[str, object]) -> list[str]:
11 return sorted(
12 field
13 for field in REQUIRED_FIELDS
14 if field not in row or not row[field]
15 )
16
17rows = [
18 {
19 "risk_id": "RF-014",
20 "owner": "model-platform-release-owner",
21 "control": "promotion_policy_gate_v7",
22 "evidence": ["attack-suite-2026-05-31"],
23 "residual_risk": "medium",
24 "next_review_on": "2026-08-31",
25 },
26 {
27 "risk_id": "CR-002",
28 "owner": "",
29 "control": "manual review",
30 "evidence": [],
31 "residual_risk": "",
32 "next_review_on": "",
33 },
34]
35
36for row in rows:
37 missing = missing_fields(row)
38 status = "READY_FOR_REVIEW" if not missing else "BLOCKED"
39 print(row["risk_id"], status, missing)1RF-014 READY_FOR_REVIEW []
2CR-002 BLOCKED ['evidence', 'next_review_on', 'owner', 'residual_risk']A model card describes intended use, out-of-scope use, limitations, evaluations, and known risks for a model or system. A datasheet describes the origin, composition, collection, processing, and recommended uses of a dataset. Model cards and datasheets were introduced as practical documentation patterns for accountability and reproducibility.[5][6]
For an LLM application, the release evidence package should reference both:
A model card isn't automatically statutory technical documentation. It's a useful engineering artifact that can support a wider documentation obligation when it's accurate, versioned, and reviewed.
1system_card = {
2 "system_version": "promotion-assistant-7.2",
3 "policy_version": "promotion-gate-v7",
4 "intended_use": "answer candidate-eval questions and propose model promotions for approval",
5 "out_of_scope": ["autonomous production promotion"],
6 "enabled_tools": ["propose_promotion"],
7 "evaluations": ["attack-suite-2026-05-31", "benign-suite-2026-05-31"],
8 "rollback_owner": "model-platform-release-owner",
9}
10
11dataset_record = {
12 "dataset_id": "promotion-redteam-v2",
13 "provenance": "curated release-request and injected-eval fixtures",
14 "labeling_guide": "unsafe_effects-v2",
15 "sensitive_fields_removed": True,
16 "evaluation_split": "frozen-promotion-eval-v2",
17 "retention_rule": "retain redacted fixtures for 90 days",
18}
19
20required_system = {
21 "system_version",
22 "policy_version",
23 "intended_use",
24 "out_of_scope",
25 "enabled_tools",
26 "evaluations",
27 "rollback_owner",
28}
29required_dataset = {
30 "dataset_id",
31 "provenance",
32 "labeling_guide",
33 "sensitive_fields_removed",
34 "evaluation_split",
35 "retention_rule",
36}
37
38missing_system = sorted(required_system - system_card.keys())
39missing_dataset = sorted(required_dataset - dataset_record.keys())
40print("system_card_complete:", not missing_system)
41print("dataset_record_complete:", not missing_dataset)
42print("evidence_package_ready:", not missing_system and not missing_dataset)1system_card_complete: True
2dataset_record_complete: True
3evidence_package_ready: TrueThe prompt-injection lesson separated untrusted retrieved text from trusted policy and gated promotion effects in code. Governance requires a replayable record of that decision:
An audit trail shouldn't store hidden reasoning or unlimited private evaluation data. Store the business facts needed to reproduce an effect decision, apply access restrictions and retention rules, and keep sensitive content out unless it's necessary and authorized.
1from dataclasses import dataclass, asdict
2from hashlib import sha256
3import hmac
4
5AUDIT_PSEUDONYM_KEY = b"local-demo-key-not-for-production"
6
7@dataclass(frozen=True)
8class AuditRecord:
9 request_id: str
10 actor_role: str
11 workflow_version: str
12 policy_version: str
13 retrieved_source_id: str
14 retrieved_source_trust: str
15 proposed_effect: str
16 gate_decision: str
17 approval_id: str
18 final_effect: str
19 appeal_id: str
20
21def pseudonymize_actor_id(actor_id: str) -> str:
22 digest = hmac.new(AUDIT_PSEUDONYM_KEY, actor_id.encode(), sha256).hexdigest()
23 return f"actor-{digest[:24]}"
24
25record = AuditRecord(
26 request_id=f"promotion/{pseudonymize_actor_id('USER-918204')}/001",
27 actor_role="model_owner",
28 workflow_version="promotion-assistant-7.2",
29 policy_version="promotion-gate-v7",
30 retrieved_source_id="candidate-eval/C17/R42",
31 retrieved_source_trust="UNTRUSTED_EVAL_CONTENT",
32 proposed_effect="promote:C17:prod-10pct",
33 gate_decision="BLOCKED_REQUIRES_APPROVAL",
34 approval_id="APR-48291",
35 final_effect="NO_PROMOTION_EXECUTED",
36 appeal_id="APL-48291",
37)
38
39for field, value in asdict(record).items():
40 print(f"{field}: {value}")1request_id: promotion/actor-f027638fb6653d5861c880a5/001
2actor_role: model_owner
3workflow_version: promotion-assistant-7.2
4policy_version: promotion-gate-v7
5retrieved_source_id: candidate-eval/C17/R42
6retrieved_source_trust: UNTRUSTED_EVAL_CONTENT
7proposed_effect: promote:C17:prod-10pct
8gate_decision: BLOCKED_REQUIRES_APPROVAL
9approval_id: APR-48291
10final_effect: NO_PROMOTION_EXECUTED
11appeal_id: APL-48291This example hardcodes the key only to stay runnable locally. In production, keep a versioned HMAC key in a secret manager, restrict access, and document rotation. Pseudonyms reduce direct exposure, but they remain linkable and still require privacy controls.
Restricted storage and access controls matter first. A chained digest plus a protected anchor can also show that a stored sequence was rewritten after the review record was created. Store that anchor outside the mutable event log. This is an integrity signal, not a complete audit-storage design.
1import hashlib
2import json
3
4def digest_event(event: str, value: str, previous: str) -> str:
5 payload = json.dumps({"event": event, "value": value, "previous": previous}, sort_keys=True)
6 return hashlib.sha256(payload.encode()).hexdigest()
7
8def append_event(chain: list[dict[str, str]], event: str, value: str) -> None:
9 previous = chain[-1]["digest"] if chain else "GENESIS"
10 digest = digest_event(event, value, previous)
11 chain.append({"event": event, "value": value, "previous": previous, "digest": digest})
12
13def recompute_digests(chain: list[dict[str, str]]) -> None:
14 for index, item in enumerate(chain):
15 previous = chain[index - 1]["digest"] if index else "GENESIS"
16 item["previous"] = previous
17 item["digest"] = digest_event(item["event"], item["value"], previous)
18
19def verifies(chain: list[dict[str, str]], protected_anchor: str) -> bool:
20 expected_previous = "GENESIS"
21 for item in chain:
22 expected_digest = digest_event(item["event"], item["value"], expected_previous)
23 if item["previous"] != expected_previous or item["digest"] != expected_digest:
24 return False
25 expected_previous = item["digest"]
26 return expected_previous == protected_anchor
27
28events: list[dict[str, str]] = []
29append_event(events, "tool_proposal", "promote:C17:prod-10pct")
30append_event(events, "policy_gate", "blocked_requires_approval")
31append_event(events, "effect", "none")
32review_anchor = events[-1]["digest"] # Copy to a separately protected review record.
33print("review_anchor:", review_anchor[:12])
34print("original_chain_valid:", verifies(events, review_anchor))
35
36events[1]["value"] = "approved"
37recompute_digests(events)
38print("rewritten_chain_matches_anchor:", verifies(events, review_anchor))1review_anchor: 252f127e01b6
2original_chain_valid: True
3rewritten_chain_matches_anchor: FalseFor EU high-risk deployments, logging, documentation, record-keeping, and human oversight can be regulated duties, and the obligations depend on the provider or deployer role and the system classification.[3] The design lesson is durable even when specific law changes: preserve enough trustworthy evidence to inspect consequential behavior.
Red teaming means deliberately attempting to elicit harmful or disallowed behavior before attackers, users, or staff encounter it. Research on language-model red teaming shows that adversarial testing can discover harmful behaviors and create reusable evaluation data.[7]
For a governed model workflow, don't report only that an attack prompt was rejected. Evaluate observable effects:
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class TestResult:
5 kind: str
6 unsafe_effect: bool
7 wrongly_blocked: bool = False
8
9results = [
10 TestResult("attack", unsafe_effect=False),
11 TestResult("attack", unsafe_effect=False),
12 TestResult("attack", unsafe_effect=True),
13 TestResult("benign", unsafe_effect=False, wrongly_blocked=False),
14 TestResult("benign", unsafe_effect=False, wrongly_blocked=True),
15]
16
17attacks = [result for result in results if result.kind == "attack"]
18benign = [result for result in results if result.kind == "benign"]
19asr = sum(result.unsafe_effect for result in attacks) / len(attacks)
20frr = sum(result.wrongly_blocked for result in benign) / len(benign)
21release_reasons = []
22if asr > 0:
23 release_reasons.append("unsafe effect survived attack test")
24if frr > 0.10:
25 release_reasons.append("legitimate requests wrongly blocked")
26
27print(f"asr: {asr:.2f}")
28print(f"frr: {frr:.2f}")
29print("finding_owner: model-platform-release-owner")
30print("release_blocked:", bool(release_reasons))
31print("release_reasons:", release_reasons)1asr: 0.33
2frr: 0.50
3finding_owner: model-platform-release-owner
4release_blocked: True
5release_reasons: ['unsafe effect survived attack test', 'legitimate requests wrongly blocked']A single forbidden promotion blocks this release even if the aggregate rate looks small. An excessive FRR also needs work: a safety control that strands legitimate model owners creates a different harm and increases appeal volume.
A secure effect gate isn't sufficient if legitimate model owners are consistently blocked on a supported interaction path or can't reach an appeal. Ethics becomes engineering work when the team measures outcomes, investigates disparities, and repairs barriers.
For a release assistant, include ordinary eligible promotion scenarios through each supported path: web console, CLI workflow, keyboard-only navigation, screen-reader-assisted interaction, and escalation to a person. Test the interface with users and accessibility specialists where possible. A small fixture set can reveal a release blocker; it can't establish that a product is fair for every affected population.
This diagnostic test makes a broken accessible path visible before release:
1from collections import defaultdict
2
3cases = [
4 ("standard_chat", True, False),
5 ("standard_chat", True, False),
6 ("screen_reader_path", True, True),
7 ("screen_reader_path", True, False),
8]
9
10blocked_by_path: dict[str, list[bool]] = defaultdict(list)
11for path, eligible, wrongly_blocked in cases:
12 if eligible:
13 blocked_by_path[path].append(wrongly_blocked)
14
15rates: dict[str, float] = {}
16for path, blocked in sorted(blocked_by_path.items()):
17 rates[path] = sum(blocked) / len(blocked)
18 print(f"{path}_false_rejection_rate: {rates[path]:.2f}")
19
20investigation_required = any(rate > 0.25 for rate in rates.values())
21print("investigation_required:", investigation_required)
22print("release_action: repair path and retest" if investigation_required else "release_action: proceed")1screen_reader_path_false_rejection_rate: 0.50
2standard_chat_false_rejection_rate: 0.00
3investigation_required: True
4release_action: repair path and retestStore this report with the dataset version, test limitations, accessibility review, and remediation owner. It complements oversight: the escalation route must itself be usable by the people who need it.
"Human in the loop" is too vague for a release review. Define which effects require approval, what information the reviewer sees, how conflicting interests are handled, and how an affected actor challenges an outcome.
For model promotions:
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class PromotionProposal:
5 target_percent: int
6 saw_untrusted_instruction: bool
7
8def oversight_route(proposal: PromotionProposal) -> str:
9 if proposal.saw_untrusted_instruction:
10 return "SECURITY_REVIEW"
11 if proposal.target_percent > 10:
12 return "EXCEPTION_REVIEW"
13 return "STANDARD_APPROVAL"
14
15proposal = PromotionProposal(target_percent=10, saw_untrusted_instruction=True)
16effect_executed_before_review = False
17decision = "DENIED_BY_REVIEWER"
18appeal = "QUEUED_FOR_SECOND_REVIEW" if decision.startswith("DENIED") else "NOT_REQUIRED"
19
20print("oversight_route:", oversight_route(proposal))
21print("effect_executed_before_review:", effect_executed_before_review)
22print("decision:", decision)
23print("appeal:", appeal)1oversight_route: SECURITY_REVIEW
2effect_executed_before_review: False
3decision: DENIED_BY_REVIEWER
4appeal: QUEUED_FOR_SECOND_REVIEWFor a legally high-risk workflow, human oversight requirements need careful mapping to the applicable duties and roles.[3] For any workflow with production or people-facing effects, a clear review and appeal mechanism is also sound product engineering.
Governance fails when documents live in separate folders while the deployment pipeline ignores them. A release candidate should fail if required evidence is missing or if a red-team finding remains open.
For the model-promotion assistant, the minimal package contains:
| Evidence | Why it exists |
|---|---|
| Workflow memo | Records intended use, people affected, and dated review route |
| Risk row | Connects harm to control, owner, residual risk, and next review |
| System card | Identifies shipped policy, model, tools, limitations, and tests |
| Dataset record | Makes evaluation cases and labels reproducible |
| Audit replay | Shows a proposed promotion was gated and preserved correctly |
| Red-team result | Blocks release when an unsafe effect remains |
| Accessibility and slice report | Finds valid requests blocked on supported paths |
| Oversight path | Shows approval and appeal behavior exists |
1REQUIRED_EVIDENCE = {
2 "workflow_memo",
3 "risk_register_row",
4 "system_card",
5 "dataset_record",
6 "audit_replay",
7 "red_team_report",
8 "accessibility_and_slice_report",
9 "oversight_runbook",
10}
11
12def release_decision(
13 evidence: set[str],
14 unsafe_effects: int,
15 open_accessibility_findings: int,
16) -> tuple[bool, list[str]]:
17 reasons: list[str] = []
18 missing = sorted(REQUIRED_EVIDENCE - evidence)
19 if missing:
20 reasons.append(f"missing evidence: {', '.join(missing)}")
21 if unsafe_effects:
22 reasons.append(f"unsafe effects remain: {unsafe_effects}")
23 if open_accessibility_findings:
24 reasons.append(f"accessibility findings remain: {open_accessibility_findings}")
25 return not reasons, reasons
26
27draft_evidence = REQUIRED_EVIDENCE - {"dataset_record"}
28draft_ready, draft_reasons = release_decision(
29 draft_evidence,
30 unsafe_effects=1,
31 open_accessibility_findings=1,
32)
33print("draft_ready:", draft_ready)
34print("draft_reasons:", draft_reasons)
35
36reviewed_ready, reviewed_reasons = release_decision(
37 REQUIRED_EVIDENCE,
38 unsafe_effects=0,
39 open_accessibility_findings=0,
40)
41print("reviewed_ready:", reviewed_ready)
42print("reviewed_reasons:", reviewed_reasons)1draft_ready: False
2draft_reasons: ['missing evidence: dataset_record', 'unsafe effects remain: 1', 'accessibility findings remain: 1']
3reviewed_ready: True
4reviewed_reasons: []Notice that dataset_record is release evidence, not administrative decoration. If nobody can identify where evaluation examples came from or how feedback labels were assigned, the result can't be reliably reproduced. The next lesson develops that data pipeline.
Take the poisoned eval-summary trace from the prompt-injection chapter and produce a reviewable package:
promotion-assistant-7.2: intended use, stakeholders, tool effects, review route, and legal_basis_checked_on.A strong submission doesn't claim the workflow is compliant because a Markdown file exists. It states the review route and date, connects a plausible harm to owned controls, produces reproducible test and trace evidence, and blocks release when one forbidden effect survives.
Deliberately remove the dataset record, set unsafe_effects=1, or leave open_accessibility_findings=1 in the release gate. If the candidate still releases, your governance process is only descriptive. The gate must be connected to the deployment decision.
You're ready to operate this workflow when you can do all of the following:
legal_basis_checked_on and require review before release changes.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Artificial Intelligence Risk Management Framework (AI RMF 1.0)
National Institute of Standards and Technology · 2023
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile
National Institute of Standards and Technology · 2024 · NIST
EU AI Act: Regulation laying down harmonised rules on artificial intelligence
European Parliament and Council of the European Union · 2024
EU agrees to simplify AI rules to boost innovation and ban nudification apps to protect citizens
European Commission · 2026
Model Cards for Model Reporting
Mitchell, M., Wu, S., Zaldivar, A., et al. · 2019 · FAT* 2019
Datasheets for Datasets
Gebru, T., Morgenstern, J., Vecchione, B., et al. · 2021 · Communications of the ACM
Red Teaming Language Models with Language Models.
Perez, E., et al. · 2022 · EMNLP 2022