Master advanced MLOps and DevOps patterns for LLM systems: GitOps for prompts and models, feature stores for embedding features, automated rollback on eval regression, shadow traffic, and model registries with rollback metadata.
Reasoning models can spend more inference compute on hard requests. That makes releases trickier: a prompt, router, judge, feature, or model alias change can alter cost and quality even when the base weights stay fixed. Advanced MLOps (Machine Learning Operations) and DevOps (Development Operations) for AI are the release and recovery practices that keep those moving pieces auditable.
Your DeployBuddy release assistant has been stable for weeks on prompt version [email protected] and Low-Rank Adaptation (LoRA) adapter v1.9. A teammate adds a small improvement: the assistant now pulls recent failure count and average queue time from a feature store, a platform for versioned model inputs used during evaluation and serving. The extra values should improve SLA estimates during regional load spikes. The change passes the offline golden-set evaluation, a fixed collection of reviewed requests and expected behavior, plus the safety scanner and latency budget. It's promoted through staging, a 2% canary, and finally the production alias.
Forty-eight hours later, users in the western region start receiving SLA estimates that are off by a full day, even though the base model weights, system prompt, and LoRA adapter are unchanged. The answers degraded because another release input changed.
Silent training-serving skew in release_pattern_embedding caused the failure. The batch job that built the golden set chunked and normalized release-history text one way, but the online feature materialization path used a different preprocessing configuration. Although the model registry recorded the adapter and prompt version, it had no record of the exact feature computation graph or the timestamped feature values present during evaluation. Basic model versioning wasn't enough once features and retrieval logic entered the picture.
Earlier in Model Versioning & Continuous Deployment, you learned immutable artifacts, aliases, canary traffic, and automated evaluation gates[1][2]. Those ideas handle the model release. Advanced MLOps and DevOps for AI adds the data, prompt, feature, eval, and rollback layers around it, which is where much of the hidden technical debt in production ML systems accumulates[3].
Mature LLM teams handle that with GitOps-driven pipelines for prompts and models, feature stores that keep embeddings consistent, safe shadow and canary experiments for prompts and models, and observability wired directly into automated rollback policies.
Classic MLOps already deals with data drift, feature stores, reproducibility, and probabilistic models. LLMOps extends that discipline because application behavior can also change through prompts, retrieval indexes, tool schemas, judges, routing, and generation settings, even when model weights stay fixed.[4][3]
| Dimension | Classic MLOps | LLMOps |
|---|---|---|
| Core artifact | Model binary, feature store | Prompts, RAG configs, embeddings, eval sets, plus the model |
| Output behavior | Often scored against labeled targets | Open-ended or sampled output often needs rubric-based scoring |
| Tests | Unit tests plus task metrics | Those tests plus semantic eval, golden sets, and safety checks |
| CI cost | Data and model evaluation can be expensive | Generation-based evals add token, latency, and judge costs |
| Monitoring | Accuracy, data drift, service health | Groundedness, refusals, policy violations, cost, plus drift |
Three ideas drive the differences:
Mature teams operationalize those three ideas with GitOps to version every input, feature stores to stop silent skew, eval-gated CI/CD, progressive delivery, and observability wired into automated rollback.
GitOps treats your prompts, feature definitions, model aliases, and deployment configuration like application code.[5] The desired state lives in a git repository and ordinary promotions flow through reviewed changes. A production design still needs an audited break-glass path for incidents, followed by reconciliation back into git.
A typical LLM GitOps repository looks like this:
1llm-ops-repo/
2โโโ prompts/
3โ โโโ support/
4โ โ โโโ [email protected]
5โ โ โโโ [email protected]
6โ โ โโโ tests/
7โ โ โโโ support_golden.jsonl
8โ โโโ policy/
9โ โโโ [email protected]
10โโโ features/
11โ โโโ release_history/
12โ โโโ feature_view.py # Feast-style definition
13โ โโโ entity_keys.yaml
14โโโ models/
15โ โโโ deploy-assistant/
16โ โโโ [email protected]
17โ โโโ [email protected] # points to registry URI + hash
18โ โโโ promotion-policy.yaml
19โโโ deployments/
20โ โโโ production/
21โ โโโ aliases.yaml # production -> [email protected]
22โโโ .github/workflows/
23 โโโ llm-promote.ymlThe pipeline on merge does the following:
Because desired state lives in git, an ordinary rollback is a revert commit or an alias-manifest change followed by the pipeline. An emergency controller may revert traffic before that commit lands, but it must emit an audit record and reconcile desired state so Git doesn't immediately re-apply the failed candidate.
Model registries such as MLflow support aliases that can be reassigned independently of production code. The runtime can load models:/deploy-assistant@production while the registry points that alias to a different immutable version[6].
Putting prompt Markdown files in git is the easy part; the pull request must evaluate against the same feature definitions, golden sets, guardrails, and alias state that production will use.
1import hashlib
2import json
3
4def release_id(release: dict[str, str]) -> str:
5 payload = json.dumps(release, sort_keys=True).encode()
6 return hashlib.sha256(payload).hexdigest()[:10]
7
8baseline = {"prompt": "[email protected]", "model": "[email protected]", "feature_view": "releases@4"}
9candidate = baseline | {"feature_view": "releases@5"}
10print(f"baseline={release_id(baseline)} candidate={release_id(candidate)}")
11print(f"same_release={release_id(baseline) == release_id(candidate)}")1baseline=698df6d7ad candidate=f91be2a454
2same_release=FalseThe changed feature view creates a new composed release even though prompt and model identifiers are unchanged.
1git_desired_alias = "[email protected]"
2runtime_alias = "[email protected]"
3break_glass_event = {"reason": "latency breach", "actor": "rollout-controller"}
4
5needs_reconciliation = git_desired_alias != runtime_alias
6print(f"runtime_alias={runtime_alias} audited={bool(break_glass_event)}")
7print(f"open_revert_commit={needs_reconciliation}")1[email protected] audited=True
2open_revert_commit=True
Embedding pipelines are a common source of training-serving skew in modern LLM systems. A RAG pipeline or a personalized prompt may depend on dozens of embedding features: user history summary vectors, document chunk embeddings, category affinity vectors, and time-decayed engagement signals.
A feature store (such as Feast, Tecton, or a custom layer around Redis, object storage, and a vector database)[7] gives you mechanisms for three problems, provided the definitions and materialization paths are versioned and tested:
Feature platforms such as Feast can provide offline stores, online stores, and point-in-time feature retrieval. Treat the feature platform as one part of the lineage system, not a substitute for versioning. If your RAG stack uses a separate vector database, the vector index version, corpus snapshot, embedding model, chunker, and normalization hash still need to be versioned beside the feature view.
In practice, your feature view for release history might look like this conceptual Feast-style sketch:
1from feast import Entity, FeatureView, Field
2from feast.types import Array, Float32, Int64
3from datetime import timedelta
4
5release_history = Entity(name="user_id", join_keys=["user_id"])
6
7release_history_view = FeatureView(
8 name="release_history_embeddings",
9 entities=[release_history],
10 ttl=timedelta(days=90),
11 schema=[
12 Field(name="recent_failure_count", dtype=Int64),
13 Field(name="avg_queue_minutes_30d", dtype=Float32),
14 Field(name="release_pattern_embedding", dtype=Array(Float32)),
15 ],
16 source=release_history_source,
17)At serving time the LLM gateway calls the feature store with the current user_id and request timestamp. The store returns online feature values or stored vectors associated with that entity. Historical retrieval for evaluation should be point-in-time correct; online/offline parity still needs tests and logged fingerprints for any request-time transformation.
Without a declared feature layer and parity checks, teams easily copy embedding code between training notebooks and serving services. Over time the two copies can diverge and quality can degrade as in the DeployBuddy story.
1from datetime import datetime
2
3event_time = datetime.fromisoformat("2026-04-10T12:00:00")
4feature_history = [
5 (datetime.fromisoformat("2026-04-09T09:00:00"), 4),
6 (datetime.fromisoformat("2026-04-10T09:00:00"), 5),
7 (datetime.fromisoformat("2026-04-11T09:00:00"), 8),
8]
9eligible = [row for row in feature_history if row[0] <= event_time]
10timestamp, recent_failure_count = max(eligible)
11print(f"event_date={event_time.date()} feature_date={timestamp.date()}")
12print(f"recent_failure_count={recent_failure_count}")1event_date=2026-04-10 feature_date=2026-04-10
2recent_failure_count=51import hashlib
2import json
3
4def fingerprint(config: dict[str, str]) -> str:
5 return hashlib.sha256(json.dumps(config, sort_keys=True).encode()).hexdigest()[:8]
6
7offline = {"chunker": "sentence@3", "embedding": "embed@2", "normalize": "l2"}
8online = {"chunker": "sentence@3", "embedding": "embed@2", "normalize": "none"}
9print(f"offline={fingerprint(offline)} online={fingerprint(online)}")
10print(f"parity={fingerprint(offline) == fingerprint(online)}")1offline=ce667443 online=660645c4
2parity=FalsePrompts deserve the same release discipline as model weights. A one-line system-prompt change can materially shift refusal rates, tone, or factual grounding without changing weights. Prompts therefore live in their own registry with semantic versions, test suites, and promotion policies.
A production prompt release manifest can contain[8]:
Because evals call the model, they are slower and more expensive than ordinary unit tests. Teams tier the gates so checks without model calls fail fast before any paid generation runs[4]:
Only after both gates pass does CI publish [email protected] and update the candidate alias.
1baseline = {"quality": 0.84, "tokens": 620, "p95_ms": 820}
2candidate = {"quality": 0.87, "tokens": 790, "p95_ms": 970}
3limits = {"max_token_increase": 0.15, "max_p95_ms": 1000}
4
5token_increase = candidate["tokens"] / baseline["tokens"] - 1
6quality_improved = candidate["quality"] > baseline["quality"]
7passes = (
8 quality_improved
9 and token_increase <= limits["max_token_increase"]
10 and candidate["p95_ms"] <= limits["max_p95_ms"]
11)
12print(f"quality_improved={quality_improved} token_increase={token_increase:.1%}")
13print(f"promote={passes}")1quality_improved=True token_increase=27.4%
2promote=FalseThe same pipeline supports shadow traffic for prompts. The gateway receives the live request, calls the production prompt version for the user, and in parallel (or on a sampled fraction) calls the candidate prompt version against the same model and features. Shadow responses are logged but never shown to users. After enough shadow requests for the predefined decision rule, the team compares judge scores, latency, and cost before deciding whether to start a small sticky canary.
Shadow traffic is especially useful for prompts because a bad shared template can still have a broad blast radius. The cost is still real, especially for reasoning models, but it's bounded and easier to justify than showing untested answers to customers.
1import hashlib
2
3def in_canary(subject_id: str, percentage: int) -> bool:
4 bucket = int(hashlib.sha256(subject_id.encode()).hexdigest()[:8], 16) % 100
5 return bucket < percentage
6
7subject = "tenant-42:conversation-9"
8routes = ["candidate" if in_canary(subject, 10) else "control" for _ in range(3)]
9print(f"routes={routes}")
10print(f"sticky={len(set(routes)) == 1}")1routes=['control', 'control', 'control']
2sticky=True1control = {"judge_scores": [0.82, 0.86, 0.84], "tokens": [600, 640, 620]}
2shadow = {"judge_scores": [0.85, 0.88, 0.85], "tokens": [700, 760, 720]}
3
4quality_delta = sum(shadow["judge_scores"]) / 3 - sum(control["judge_scores"]) / 3
5token_delta = sum(shadow["tokens"]) / 3 / (sum(control["tokens"]) / 3) - 1
6print(f"quality_delta={quality_delta:+.3f} token_delta={token_delta:+.1%}")
7print(f"needs_cost_review={token_delta > 0.10}")1quality_delta=+0.020 token_delta=+17.2%
2needs_cost_review=TrueThe fastest way to lose user trust is to let a degraded version stay in production while engineers investigate. Once you have reliable online signals from the observability work earlier in the course, you can close the loop.
A production controller watches:
Policy should distinguish objective breaches from noisy evidence. A sustained latency, error, cost, or safety breach can automatically shift traffic to the last known good version. A small sampled-judge or business-metric delta may instead pause promotion and request review. Updating an alias is fast control-plane work, but a runtime may still need to warm or load the restored artifact.
This simplified rollback policy returns rollback for objective infrastructure and safety breaches; a quality-only breach returns pause_for_review. A real controller must also evaluate sustained windows and confidence intervals.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Threshold:
5 max_value: float | None = None
6 min_value: float | None = None
7 window: str = "5m"
8 action: str = "rollback"
9
10class RollbackPolicy:
11 def __init__(self):
12 self.thresholds = {
13 "p95_ttft_ms": Threshold(max_value=850, window="5m"),
14 "groundedness_score": Threshold(
15 min_value=0.82, window="15m", action="pause_for_review"
16 ),
17 "toxicity_rate": Threshold(max_value=0.004, window="10m"),
18 "task_completion_rate": Threshold(min_value=0.91, window="30m"),
19 }
20
21 def violations(self, metrics: dict[str, float]) -> list[tuple[str, str]]:
22 violations: list[tuple[str, str]] = []
23
24 for metric, rule in self.thresholds.items():
25 value = metrics.get(metric)
26 if value is None:
27 continue
28
29 if rule.max_value is not None and value > rule.max_value:
30 violations.append(
31 (rule.action, f"{metric}={value} exceeds max {rule.max_value} over {rule.window}")
32 )
33 if rule.min_value is not None and value < rule.min_value:
34 violations.append(
35 (rule.action, f"{metric}={value} below min {rule.min_value} over {rule.window}")
36 )
37
38 return violations
39
40 def decision(self, metrics: dict[str, float]) -> str:
41 actions = {action for action, _ in self.violations(metrics)}
42 if "rollback" in actions:
43 return "rollback"
44 if "pause_for_review" in actions:
45 return "pause_for_review"
46 return "continue"
47
48policy = RollbackPolicy()
49canary_metrics = {
50 "p95_ttft_ms": 910,
51 "groundedness_score": 0.79,
52 "toxicity_rate": 0.001,
53 "task_completion_rate": 0.93,
54}
55
56violations = policy.violations(canary_metrics)
57print(f"decision={policy.decision(canary_metrics)}")
58for action, reason in violations:
59 print(f"{action}: {reason}")1decision=rollback
2rollback: p95_ttft_ms=910 exceeds max 850 over 5m
3pause_for_review: groundedness_score=0.79 below min 0.82 over 15mThe controller runs this check against aggregated metrics for the current alias. Progressive-delivery systems such as Argo Rollouts can query metrics and drive automated promotion or rollback under declared analysis policies, but the same pattern works in a custom LLM gateway if aliases and registry history are immutable.[9]
Rolling back only on error rate or latency misses slow quality regressions. A groundedness drop may not trigger a 5xx spike, yet it can damage user trust. Sampled LLM-judge signals can flag this class of regression, but judge drift and sampling uncertainty mean they should be calibrated before driving automatic rollback.
1ttft_limit_ms = 850
2windows = [790, 910, 930, 905]
3required_consecutive_breaches = 3
4consecutive = 0
5
6for value in windows:
7 consecutive = consecutive + 1 if value > ttft_limit_ms else 0
8
9rollback = consecutive >= required_consecutive_breaches
10print(f"last_consecutive_breaches={consecutive}")
11print(f"rollback={rollback}")1last_consecutive_breaches=3
2rollback=TrueThe operational goal of advanced MLOps is sufficient provenance for diagnosis and rollback. Given any production response, an engineer should be able to answer:
This information can live in a composed release manifest linked from request traces, with model-registry versions, feature-store or index audit records, and the git commit that triggered promotion. When an incident occurs, start from an affected request timestamp and resolve the served release tuple before hypothesizing about the cause.
1release_manifests = {
2 "release-a91": {
3 "prompt": "[email protected]",
4 "model": "[email protected]",
5 "feature_view": "releases@5",
6 "git_commit": "9f3c2a1",
7 }
8}
9trace = {"request_id": "req_8a2", "release_id": "release-a91", "region": "us-west"}
10served = release_manifests[trace["release_id"]]
11print(f"request={trace['request_id']} region={trace['region']}")
12print(f"prompt={served['prompt']} feature_view={served['feature_view']} commit={served['git_commit']}")1request=req_8a2 region=us-west
2[email protected] feature_view=releases@5 commit=9f3c2a1With these practices in place, a team can audit a bad answer, identify the served tuple, and restore a known-good release without guessing which hidden input changed.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
Continuous Delivery for Machine Learning.
Sato, D., Wider, A., & Windheuser, C. ยท 2019
Challenges in Deploying Machine Learning: a Survey of Case Studies.
Paleyes, A., Urma, R. G., & Lawrence, N. D. ยท 2022 ยท ACM Computing Surveys
Hidden Technical Debt in Machine Learning Systems.
Sculley et al. ยท 2015
What is LLMOps? LLM Operations Guide
MLflow ยท 2026
OpenGitOps Principles
OpenGitOps Project ยท 2026
Model Registry Workflows | MLflow AI Platform
MLflow ยท 2026
Feast: Production Feature Store for Machine Learning
Feast Contributors ยท 2024
Prompt Engineering Concepts
LangChain ยท 2026
Argo Rollouts - Kubernetes Progressive Delivery Controller
Argo Project ยท 2026