LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

๐Ÿ› ๏ธComputing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
๐Ÿ“ŠMath & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
๐Ÿ“šPreparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
๐ŸงฎML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
๐Ÿ“ฆProduction ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฐApplied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
๐ŸŽคAI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleAdvanced MLOps & DevOps for AI
โš™๏ธHardMLOps & Deployment

Advanced MLOps & DevOps for AI

Master advanced MLOps and DevOps patterns for LLM systems: GitOps for prompts and models, feature stores for embedding features, automated rollback on eval regression, shadow traffic, and model registries with rollback metadata.

23 min read
Learning path
Step 143 of 158 in the full curriculum
Reasoning & Test-Time ComputeGPU Serving & Autoscaling

Reasoning models can spend more inference compute on hard requests. That makes releases trickier: a prompt, router, judge, feature, or model alias change can alter cost and quality even when the base weights stay fixed. Advanced MLOps (Machine Learning Operations) and DevOps (Development Operations) for AI are the release and recovery practices that keep those moving pieces auditable.

Your DeployBuddy release assistant has been stable for weeks on prompt version [email protected] and Low-Rank Adaptation (LoRA) adapter v1.9. A teammate adds a small improvement: the assistant now pulls recent failure count and average queue time from a feature store, a platform for versioned model inputs used during evaluation and serving. The extra values should improve SLA estimates during regional load spikes. The change passes the offline golden-set evaluation, a fixed collection of reviewed requests and expected behavior, plus the safety scanner and latency budget. It's promoted through staging, a 2% canary, and finally the production alias.

Forty-eight hours later, users in the western region start receiving SLA estimates that are off by a full day, even though the base model weights, system prompt, and LoRA adapter are unchanged. The answers degraded because another release input changed.

Silent training-serving skew in release_pattern_embedding caused the failure. The batch job that built the golden set chunked and normalized release-history text one way, but the online feature materialization path used a different preprocessing configuration. Although the model registry recorded the adapter and prompt version, it had no record of the exact feature computation graph or the timestamped feature values present during evaluation. Basic model versioning wasn't enough once features and retrieval logic entered the picture.

Earlier in Model Versioning & Continuous Deployment, you learned immutable artifacts, aliases, canary traffic, and automated evaluation gates[1]Reference 1Continuous Delivery for Machine Learning.https://martinfowler.com/articles/cd4ml.html[2]Reference 2Challenges in Deploying Machine Learning: a Survey of Case Studies.https://arxiv.org/abs/2011.09926. Those ideas handle the model release. Advanced MLOps and DevOps for AI adds the data, prompt, feature, eval, and rollback layers around it, which is where much of the hidden technical debt in production ML systems accumulates[3]Reference 3Hidden Technical Debt in Machine Learning Systems.https://research.google/pubs/hidden-technical-debt-in-machine-learning-systems/.

Mature LLM teams handle that with GitOps-driven pipelines for prompts and models, feature stores that keep embeddings consistent, safe shadow and canary experiments for prompts and models, and observability wired directly into automated rollback policies.

GitOps release path where one reviewed change turns into a pinned tuple of prompt, feature view, model alias, and rollback policy, then moves through CI, frozen stores, and bounded delivery. GitOps release path where one reviewed change turns into a pinned tuple of prompt, feature view, model alias, and rollback policy, then moves through CI, frozen stores, and bounded delivery.
GitOps turns prompt, feature, model, and rollback state into one reviewed release tuple instead of runtime drift.

Why LLMOps extends MLOps

Classic MLOps already deals with data drift, feature stores, reproducibility, and probabilistic models. LLMOps extends that discipline because application behavior can also change through prompts, retrieval indexes, tool schemas, judges, routing, and generation settings, even when model weights stay fixed.[4]Reference 4What is LLMOps? LLM Operations Guidehttps://mlflow.org/llmops[3]Reference 3Hidden Technical Debt in Machine Learning Systems.https://research.google/pubs/hidden-technical-debt-in-machine-learning-systems/

DimensionClassic MLOpsLLMOps
Core artifactModel binary, feature storePrompts, RAG configs, embeddings, eval sets, plus the model
Output behaviorOften scored against labeled targetsOpen-ended or sampled output often needs rubric-based scoring
TestsUnit tests plus task metricsThose tests plus semantic eval, golden sets, and safety checks
CI costData and model evaluation can be expensiveGeneration-based evals add token, latency, and judge costs
MonitoringAccuracy, data drift, service healthGroundedness, refusals, policy violations, cost, plus drift

Three ideas drive the differences:

  • Open-ended and sampled output. The same prompt may produce different text on repeated calls, and even deterministic generation can have several acceptable responses. Exact equality remains useful for structured subcomponents; end-to-end quality usually needs scored evaluation.
  • Prompts (and RAG config, guardrails, eval sets) are first-class artifacts. A one-line prompt edit can materially shift refusal rate or grounding without changing weights, so prompts get versioned, reviewed, and rolled back like code, not buried in application strings.
  • Eval gates join the test suite. Many generated answers don't have one exact-string target. Promotion therefore needs scored evaluation against curated golden sets in addition to ordinary unit tests. A common eval-driven workflow writes the golden set before the prompt change.

Mature teams operationalize those three ideas with GitOps to version every input, feature stores to stop silent skew, eval-gated CI/CD, progressive delivery, and observability wired into automated rollback.

GitOps for LLM systems

GitOps treats your prompts, feature definitions, model aliases, and deployment configuration like application code.[5]Reference 5OpenGitOps Principleshttps://opengitops.dev/ The desired state lives in a git repository and ordinary promotions flow through reviewed changes. A production design still needs an audited break-glass path for incidents, followed by reconciliation back into git.

A typical LLM GitOps repository looks like this:

text
1llm-ops-repo/ 2โ”œโ”€โ”€ prompts/ 3โ”‚ โ”œโ”€โ”€ support/ 4โ”‚ โ”‚ โ”œโ”€โ”€ [email protected] 5โ”‚ โ”‚ โ”œโ”€โ”€ [email protected] 6โ”‚ โ”‚ โ””โ”€โ”€ tests/ 7โ”‚ โ”‚ โ””โ”€โ”€ support_golden.jsonl 8โ”‚ โ””โ”€โ”€ policy/ 9โ”‚ โ””โ”€โ”€ [email protected] 10โ”œโ”€โ”€ features/ 11โ”‚ โ””โ”€โ”€ release_history/ 12โ”‚ โ”œโ”€โ”€ feature_view.py # Feast-style definition 13โ”‚ โ””โ”€โ”€ entity_keys.yaml 14โ”œโ”€โ”€ models/ 15โ”‚ โ””โ”€โ”€ deploy-assistant/ 16โ”‚ โ”œโ”€โ”€ [email protected] 17โ”‚ โ”œโ”€โ”€ [email protected] # points to registry URI + hash 18โ”‚ โ””โ”€โ”€ promotion-policy.yaml 19โ”œโ”€โ”€ deployments/ 20โ”‚ โ””โ”€โ”€ production/ 21โ”‚ โ””โ”€โ”€ aliases.yaml # production -> [email protected] 22โ””โ”€โ”€ .github/workflows/ 23 โ””โ”€โ”€ llm-promote.yml

The pipeline on merge does the following:

  1. Parse changed files and determine what must be re-evaluated.
  2. Run fast checks (prompt linting, schema validation, cost estimation).
  3. Run the offline evaluation suite against the exact feature snapshot pinned in the PR.
  4. If gates pass, publish new immutable versions to the prompt registry and model registry.
  5. Update the alias file or deployment manifest.
  6. The GitOps operator (Argo CD or Flux) detects the change in the repo and applies it to the cluster.

Because desired state lives in git, an ordinary rollback is a revert commit or an alias-manifest change followed by the pipeline. An emergency controller may revert traffic before that commit lands, but it must emit an audit record and reconcile desired state so Git doesn't immediately re-apply the failed candidate.

Model registries such as MLflow support aliases that can be reassigned independently of production code. The runtime can load models:/deploy-assistant@production while the registry points that alias to a different immutable version[6]Reference 6Model Registry Workflows | MLflow AI Platformhttps://mlflow.org/docs/latest/ml/model-registry/workflow/.

Putting prompt Markdown files in git is the easy part; the pull request must evaluate against the same feature definitions, golden sets, guardrails, and alias state that production will use.

release-tuple-fingerprint.py
1import hashlib 2import json 3 4def release_id(release: dict[str, str]) -> str: 5 payload = json.dumps(release, sort_keys=True).encode() 6 return hashlib.sha256(payload).hexdigest()[:10] 7 8baseline = {"prompt": "[email protected]", "model": "[email protected]", "feature_view": "releases@4"} 9candidate = baseline | {"feature_view": "releases@5"} 10print(f"baseline={release_id(baseline)} candidate={release_id(candidate)}") 11print(f"same_release={release_id(baseline) == release_id(candidate)}")
Output
1baseline=698df6d7ad candidate=f91be2a454 2same_release=False

The changed feature view creates a new composed release even though prompt and model identifiers are unchanged.

break-glass-reconciliation.py
1git_desired_alias = "[email protected]" 2runtime_alias = "[email protected]" 3break_glass_event = {"reason": "latency breach", "actor": "rollout-controller"} 4 5needs_reconciliation = git_desired_alias != runtime_alias 6print(f"runtime_alias={runtime_alias} audited={bool(break_glass_event)}") 7print(f"open_revert_commit={needs_reconciliation}")
Output
1[email protected] audited=True 2open_revert_commit=True
GitOps promotion flow showing reviewed change, CI gates, immutable publish, shadow traffic, sticky canary, reversible alias move, and audited rollback. GitOps promotion flow showing reviewed change, CI gates, immutable publish, shadow traffic, sticky canary, reversible alias move, and audited rollback.
Promotion stays reversible because reviewed desired state, immutable publish, shadow checks, sticky canary, and alias control all point at the same release record.

Feature stores for embeddings and LLM features

Embedding pipelines are a common source of training-serving skew in modern LLM systems. A RAG pipeline or a personalized prompt may depend on dozens of embedding features: user history summary vectors, document chunk embeddings, category affinity vectors, and time-decayed engagement signals.

A feature store (such as Feast, Tecton, or a custom layer around Redis, object storage, and a vector database)[7]Reference 7Feast: Production Feature Store for Machine Learninghttps://feast.dev/ gives you mechanisms for three problems, provided the definitions and materialization paths are versioned and tested:

  • Point-in-time correctness: When you train or evaluate offline, you can request the exact feature values that were available at a historical timestamp. No future data leaks into the training set.
  • Consistency checks: A shared feature view and parity tests can detect whether batch and online paths produce compatible values. A feature store doesn't repair divergent transformation logic by itself.
  • Lineage: Every embedding vector should carry metadata about the exact model checkpoint, chunker version, and preprocessing hash that produced it.

Feature platforms such as Feast can provide offline stores, online stores, and point-in-time feature retrieval. Treat the feature platform as one part of the lineage system, not a substitute for versioning. If your RAG stack uses a separate vector database, the vector index version, corpus snapshot, embedding model, chunker, and normalization hash still need to be versioned beside the feature view.

In practice, your feature view for release history might look like this conceptual Feast-style sketch:

feature-stores-for-embeddings-and-llm.py
1from feast import Entity, FeatureView, Field 2from feast.types import Array, Float32, Int64 3from datetime import timedelta 4 5release_history = Entity(name="user_id", join_keys=["user_id"]) 6 7release_history_view = FeatureView( 8 name="release_history_embeddings", 9 entities=[release_history], 10 ttl=timedelta(days=90), 11 schema=[ 12 Field(name="recent_failure_count", dtype=Int64), 13 Field(name="avg_queue_minutes_30d", dtype=Float32), 14 Field(name="release_pattern_embedding", dtype=Array(Float32)), 15 ], 16 source=release_history_source, 17)

At serving time the LLM gateway calls the feature store with the current user_id and request timestamp. The store returns online feature values or stored vectors associated with that entity. Historical retrieval for evaluation should be point-in-time correct; online/offline parity still needs tests and logged fingerprints for any request-time transformation.

Without a declared feature layer and parity checks, teams easily copy embedding code between training notebooks and serving services. Over time the two copies can diverge and quality can degrade as in the DeployBuddy story.

point-in-time-feature-join.py
1from datetime import datetime 2 3event_time = datetime.fromisoformat("2026-04-10T12:00:00") 4feature_history = [ 5 (datetime.fromisoformat("2026-04-09T09:00:00"), 4), 6 (datetime.fromisoformat("2026-04-10T09:00:00"), 5), 7 (datetime.fromisoformat("2026-04-11T09:00:00"), 8), 8] 9eligible = [row for row in feature_history if row[0] <= event_time] 10timestamp, recent_failure_count = max(eligible) 11print(f"event_date={event_time.date()} feature_date={timestamp.date()}") 12print(f"recent_failure_count={recent_failure_count}")
Output
1event_date=2026-04-10 feature_date=2026-04-10 2recent_failure_count=5
feature-parity-fingerprint.py
1import hashlib 2import json 3 4def fingerprint(config: dict[str, str]) -> str: 5 return hashlib.sha256(json.dumps(config, sort_keys=True).encode()).hexdigest()[:8] 6 7offline = {"chunker": "sentence@3", "embedding": "embed@2", "normalize": "l2"} 8online = {"chunker": "sentence@3", "embedding": "embed@2", "normalize": "none"} 9print(f"offline={fingerprint(offline)} online={fingerprint(online)}") 10print(f"parity={fingerprint(offline) == fingerprint(online)}")
Output
1offline=ce667443 online=660645c4 2parity=False

Continuous Integration/Continuous Deployment (CI/CD) for prompts and models at scale

Prompts deserve the same release discipline as model weights. A one-line system-prompt change can materially shift refusal rates, tone, or factual grounding without changing weights. Prompts therefore live in their own registry with semantic versions, test suites, and promotion policies.

A production prompt release manifest can contain[8]Reference 8Prompt Engineering Conceptshttps://docs.langchain.com/langsmith/prompt-engineering-concepts:

  • The exact template text (with variable placeholders)
  • Few-shot examples, rubric snippets, or tool schemas pinned by hash
  • Associated guardrail policy version
  • Evaluation results on multiple golden sets (helpfulness, groundedness, safety, task-specific accuracy)
  • Cost and latency characteristics under different generation settings

Because evals call the model, they are slower and more expensive than ordinary unit tests. Teams tier the gates so checks without model calls fail fast before any paid generation runs[4]Reference 4What is LLMOps? LLM Operations Guidehttps://mlflow.org/llmops:

  1. Fast gate (no model calls, every PR): schema and format validation against cached responses, prompt linting, hash checks, static cost estimation. Runs in seconds and catches most broken changes.
  2. Eval gate (paid, on prompt or model change): render the prompt against the golden set using the exact model alias declared in the PR, run the LLM-as-judge and rule-based safety checks, and compute estimated token cost for the target traffic volume.

Only after both gates pass does CI publish [email protected] and update the candidate alias.

prompt-promotion-gates.py
1baseline = {"quality": 0.84, "tokens": 620, "p95_ms": 820} 2candidate = {"quality": 0.87, "tokens": 790, "p95_ms": 970} 3limits = {"max_token_increase": 0.15, "max_p95_ms": 1000} 4 5token_increase = candidate["tokens"] / baseline["tokens"] - 1 6quality_improved = candidate["quality"] > baseline["quality"] 7passes = ( 8 quality_improved 9 and token_increase <= limits["max_token_increase"] 10 and candidate["p95_ms"] <= limits["max_p95_ms"] 11) 12print(f"quality_improved={quality_improved} token_increase={token_increase:.1%}") 13print(f"promote={passes}")
Output
1quality_improved=True token_increase=27.4% 2promote=False

The same pipeline supports shadow traffic for prompts. The gateway receives the live request, calls the production prompt version for the user, and in parallel (or on a sampled fraction) calls the candidate prompt version against the same model and features. Shadow responses are logged but never shown to users. After enough shadow requests for the predefined decision rule, the team compares judge scores, latency, and cost before deciding whether to start a small sticky canary.

Shadow traffic is especially useful for prompts because a bad shared template can still have a broad blast radius. The cost is still real, especially for reasoning models, but it's bounded and easier to justify than showing untested answers to customers.

sticky-canary-routing.py
1import hashlib 2 3def in_canary(subject_id: str, percentage: int) -> bool: 4 bucket = int(hashlib.sha256(subject_id.encode()).hexdigest()[:8], 16) % 100 5 return bucket < percentage 6 7subject = "tenant-42:conversation-9" 8routes = ["candidate" if in_canary(subject, 10) else "control" for _ in range(3)] 9print(f"routes={routes}") 10print(f"sticky={len(set(routes)) == 1}")
Output
1routes=['control', 'control', 'control'] 2sticky=True
shadow-quality-cost-delta.py
1control = {"judge_scores": [0.82, 0.86, 0.84], "tokens": [600, 640, 620]} 2shadow = {"judge_scores": [0.85, 0.88, 0.85], "tokens": [700, 760, 720]} 3 4quality_delta = sum(shadow["judge_scores"]) / 3 - sum(control["judge_scores"]) / 3 5token_delta = sum(shadow["tokens"]) / 3 / (sum(control["tokens"]) / 3) - 1 6print(f"quality_delta={quality_delta:+.3f} token_delta={token_delta:+.1%}") 7print(f"needs_cost_review={token_delta > 0.10}")
Output
1quality_delta=+0.020 token_delta=+17.2% 2needs_cost_review=True

Automated rollback on evaluation regression

The fastest way to lose user trust is to let a degraded version stay in production while engineers investigate. Once you have reliable online signals from the observability work earlier in the course, you can close the loop.

A production controller watches:

  • Infrastructure metrics (TTFT p95, 5xx rate, queue depth)
  • Business metrics (task completion rate, thumbs down rate)
  • Quality signals (LLM-as-judge groundedness, refusal rate, toxicity classifier score on a continuous sample)

Policy should distinguish objective breaches from noisy evidence. A sustained latency, error, cost, or safety breach can automatically shift traffic to the last known good version. A small sampled-judge or business-metric delta may instead pause promotion and request review. Updating an alias is fast control-plane work, but a runtime may still need to warm or load the restored artifact.

Canary rollback view showing infra and quality signals against promotion thresholds. Canary rollback view showing infra and quality signals against promotion thresholds.
Objective breaches can revert automatically, while weaker sampled-quality signals can pause promotion for review.

This simplified rollback policy returns rollback for objective infrastructure and safety breaches; a quality-only breach returns pause_for_review. A real controller must also evaluate sustained windows and confidence intervals.

automated-rollback-on-evaluation-regression.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Threshold: 5 max_value: float | None = None 6 min_value: float | None = None 7 window: str = "5m" 8 action: str = "rollback" 9 10class RollbackPolicy: 11 def __init__(self): 12 self.thresholds = { 13 "p95_ttft_ms": Threshold(max_value=850, window="5m"), 14 "groundedness_score": Threshold( 15 min_value=0.82, window="15m", action="pause_for_review" 16 ), 17 "toxicity_rate": Threshold(max_value=0.004, window="10m"), 18 "task_completion_rate": Threshold(min_value=0.91, window="30m"), 19 } 20 21 def violations(self, metrics: dict[str, float]) -> list[tuple[str, str]]: 22 violations: list[tuple[str, str]] = [] 23 24 for metric, rule in self.thresholds.items(): 25 value = metrics.get(metric) 26 if value is None: 27 continue 28 29 if rule.max_value is not None and value > rule.max_value: 30 violations.append( 31 (rule.action, f"{metric}={value} exceeds max {rule.max_value} over {rule.window}") 32 ) 33 if rule.min_value is not None and value < rule.min_value: 34 violations.append( 35 (rule.action, f"{metric}={value} below min {rule.min_value} over {rule.window}") 36 ) 37 38 return violations 39 40 def decision(self, metrics: dict[str, float]) -> str: 41 actions = {action for action, _ in self.violations(metrics)} 42 if "rollback" in actions: 43 return "rollback" 44 if "pause_for_review" in actions: 45 return "pause_for_review" 46 return "continue" 47 48policy = RollbackPolicy() 49canary_metrics = { 50 "p95_ttft_ms": 910, 51 "groundedness_score": 0.79, 52 "toxicity_rate": 0.001, 53 "task_completion_rate": 0.93, 54} 55 56violations = policy.violations(canary_metrics) 57print(f"decision={policy.decision(canary_metrics)}") 58for action, reason in violations: 59 print(f"{action}: {reason}")
Output
1decision=rollback 2rollback: p95_ttft_ms=910 exceeds max 850 over 5m 3pause_for_review: groundedness_score=0.79 below min 0.82 over 15m

The controller runs this check against aggregated metrics for the current alias. Progressive-delivery systems such as Argo Rollouts can query metrics and drive automated promotion or rollback under declared analysis policies, but the same pattern works in a custom LLM gateway if aliases and registry history are immutable.[9]Reference 9Argo Rollouts - Kubernetes Progressive Delivery Controllerhttps://argoproj.github.io/argo-rollouts/

Rolling back only on error rate or latency misses slow quality regressions. A groundedness drop may not trigger a 5xx spike, yet it can damage user trust. Sampled LLM-judge signals can flag this class of regression, but judge drift and sampling uncertainty mean they should be calibrated before driving automatic rollback.

sustained-breach-window.py
1ttft_limit_ms = 850 2windows = [790, 910, 930, 905] 3required_consecutive_breaches = 3 4consecutive = 0 5 6for value in windows: 7 consecutive = consecutive + 1 if value > ttft_limit_ms else 0 8 9rollback = consecutive >= required_consecutive_breaches 10print(f"last_consecutive_breaches={consecutive}") 11print(f"rollback={rollback}")
Output
1last_consecutive_breaches=3 2rollback=True

Putting it all together: lineage and provenance

The operational goal of advanced MLOps is sufficient provenance for diagnosis and rollback. Given any production response, an engineer should be able to answer:

  • Which exact prompt version and guardrail policy produced it?
  • Which model weights and adapter were active?
  • Which feature values (with exact timestamps and computation hashes) were fed into the prompt?
  • Which golden set version and judge prompt were used to approve this release?
  • What was the commit SHA that last touched any of the above?

This information can live in a composed release manifest linked from request traces, with model-registry versions, feature-store or index audit records, and the git commit that triggered promotion. When an incident occurs, start from an affected request timestamp and resolve the served release tuple before hypothesizing about the cause.

Response lineage lookup that starts from one request trace, resolves one immutable release tuple, then compares behavior, data approval, and serving artifacts against the last known good release. Response lineage lookup that starts from one request trace, resolves one immutable release tuple, then compares behavior, data approval, and serving artifacts against the last known good release.
Start from one request trace. Resolve one release tuple. Compare that exact tuple with last known good before guessing at root cause.

response-lineage-lookup.py
1release_manifests = { 2 "release-a91": { 3 "prompt": "[email protected]", 4 "model": "[email protected]", 5 "feature_view": "releases@5", 6 "git_commit": "9f3c2a1", 7 } 8} 9trace = {"request_id": "req_8a2", "release_id": "release-a91", "region": "us-west"} 10served = release_manifests[trace["release_id"]] 11print(f"request={trace['request_id']} region={trace['region']}") 12print(f"prompt={served['prompt']} feature_view={served['feature_view']} commit={served['git_commit']}")
Output
1request=req_8a2 region=us-west 2[email protected] feature_view=releases@5 commit=9f3c2a1

Follow-up questions

Evaluation rubric

  • Foundational: Explain why prompt, feature, eval, and alias changes can break a release even when the base model weights stay fixed.
  • Intermediate: Design a GitOps release path where reviewed merges trigger fast checks, eval gates, immutable publish, shadow traffic, and sticky canary promotion.
  • Intermediate: Explain how versioned feature views, point-in-time retrieval, and parity checks reduce training-serving skew for embedding-based retrieval and personalization features.
  • Advanced: Define rollback rules that combine latency, cost, safety, task success, and sampled LLM-judge quality signals without flapping on noise.
  • Advanced: Trace one production answer from request ID to prompt version, model adapter, feature snapshot, guardrail policy, serving image, eval gate, and commit SHA.
  • Advanced: Defend why dashboards alone aren't enough, and why observability must feed promotion and rollback policy.

Common pitfalls

  • Prompt stored only in app code or an environment variable. Symptom: nobody can prove which prompt served a bad answer. Fix: publish prompt versions with hashes, aliases, eval results, cost metadata, and rollback history.
  • Embedding model pinned, preprocessing unpinned. Symptom: offline recall looks strong, but live retrieval misses obvious documents. Fix: version the embedding checkpoint, chunker, normalization, metadata filters, and index build job together.
  • Canary routing isn't sticky. Symptom: a user sees the control answer on one page refresh and the candidate answer on the next. Fix: route by stable user, tenant, or conversation hash during the whole rollout.
  • Rollback watches only infrastructure metrics. Symptom: latency and 5xx stay healthy while groundedness or refusal behavior degrades. Fix: include sampled judge scores, safety classifiers, task completion, and support escalations.
  • Feature definitions live in notebooks and serving code. Symptom: batch eval and online serving diverge silently. Fix: declare feature views once, test both historical and online retrieval paths, and log feature hashes in traces.
  • GitOps means "Argo CD is installed." Symptom: deployment manifests are declarative, but eval datasets, golden sets, and promotion policies are mutable side inputs. Fix: version every release input that can change the answer.
  • Rollback flaps. Symptom: transient spikes bounce traffic between versions. Fix: require sustained windows, cooldowns, confidence checks, and human review for noisy quality signals.

Release controls that matter

  • Prompts, features, aliases, and promotion policies belong in reviewed desired state, with an audited and reconciled emergency path.
  • Embeddings and other LLM features need versioned definitions, point-in-time historical retrieval, parity checks, and freshness monitoring.
  • Shadow and canary apply to prompts as much as to models. Extra LLM calls have measurable cost; promotion policy should budget that cost against the risk of showing untested answers to customers.
  • Objective sustained breaches can trigger automatic rollback; noisy judge and business signals may require a promotion pause and human review.
  • Production responses should resolve to the prompt, model artifact, feature snapshot, and evaluation gate that shaped them.

With these practices in place, a team can audit a bad answer, identify the served tuple, and restore a known-good release without guessing which hidden input changed.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An offline eval for an April 10 noon deployment event has feature history values: April 9 = 4, April 10 at 09:00 = 5, and April 11 = 8. Production later serves the April 9 value because an online batch job lagged. What should the release controls do?
2.A release uses the same embedding checkpoint online and offline. The offline path normalizes release-history embeddings with l2 normalization, while the online path uses no normalization. Their preprocessing fingerprints differ. What conclusion should the pipeline reach?
3.A prompt candidate improves quality from 0.84 to 0.87. Its average tokens rise from 620 to 790, and p95 latency is 970 ms against a 1000 ms limit. The promotion policy allows at most a 15% token increase. What should CI decide?
4.A candidate prompt runs in shadow on sampled live requests, while users still see the production prompt output. Before moving to a 10% canary, what rule should the team apply?
5.During an incident, Git still declares production -> [email protected], but an emergency controller has routed runtime traffic back to [email protected] and emitted an audit event. What must happen next in a GitOps workflow?
6.A canary's configured aggregation windows show sustained p95_ttft_ms = 910 with a max of 850 and rollback action, plus groundedness_score = 0.79 with a min of 0.82 and pause_for_review action. Toxicity and task completion remain within limits. What should the controller decide?
7.A sticky 5% canary shows groundedness below the review threshold, but it has only 40 judged samples and no latency, error, or safety breach. The rollout policy treats quality-only judge breaches as pause_for_review. What should happen before promotion?
8.A RAG team changes the corpus snapshot while keeping [email protected] and the generator model alias unchanged. Later an engineer must audit a bad answer from a request trace. Which release design supports diagnosis and rollback?
9.A pull request changes both a support prompt and a release-history feature view. Lint and schema checks pass, but the paid eval reads the latest feature values and whichever model production currently uses. What should the pipeline do before publishing?

9 questions remaining.

Next Step
Continue to GPU Serving & Autoscaling

That lesson covers running the versioned models and prompts you now manage under real traffic: continuous batching, paged attention, KV-cache-aware autoscaling, cold-start mitigation, and cost-efficient multi-tenancy on GPU fleets. The release controls you just built make those serving systems safer to operate in production.

PreviousReasoning & Test-Time Compute
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Continuous Delivery for Machine Learning.

Sato, D., Wider, A., & Windheuser, C. ยท 2019

Challenges in Deploying Machine Learning: a Survey of Case Studies.

Paleyes, A., Urma, R. G., & Lawrence, N. D. ยท 2022 ยท ACM Computing Surveys

Hidden Technical Debt in Machine Learning Systems.

Sculley et al. ยท 2015

What is LLMOps? LLM Operations Guide

MLflow ยท 2026

OpenGitOps Principles

OpenGitOps Project ยท 2026

Model Registry Workflows | MLflow AI Platform

MLflow ยท 2026

Feast: Production Feature Store for Machine Learning

Feast Contributors ยท 2024

Prompt Engineering Concepts

LangChain ยท 2026

Argo Rollouts - Kubernetes Progressive Delivery Controller

Argo Project ยท 2026