Master advanced MLOps and DevOps patterns for LLM systems: GitOps for prompts and models, feature stores for embedding features, automated rollback on eval regression, shadow traffic, and production-grade model registries.
Your OrderBuddy package tracker chatbot has been stable for weeks on prompt version [email protected] and LoRA adapter v1.9. A teammate adds a small improvement: the bot now pulls the customer's recent order count and average delivery time from a feature store to give more accurate estimates during peak season. The change passes the offline golden set eval, the safety scanner, and the latency budget. It is promoted through staging, a 2% canary, and finally the production alias.
Forty-eight hours later, customers in the western fulfillment region start receiving delivery promises that are off by a full day. The base model weights never changed. The system prompt text never changed. The LoRA weights are identical. Yet the answers degraded.
The root cause was silent training-serving skew in the embedding features. The "recent_order_count" feature used one chunking strategy and normalization in the batch job that built the golden set, but a slightly different preprocessing path in the online feature store that served live traffic. The model registry recorded the adapter and the prompt version, but it had no record of the exact feature computation graph or the timestamped feature values that were present at training time. The basic versioning strategy from the previous lesson was not enough once features and retrieval logic entered the picture.
In Model Versioning & Continuous Deployment, you mastered immutable artifacts, aliases, canary traffic, and automated evaluation gates[1][2]. That foundation is still essential. Advanced MLOps and DevOps for AI systems adds the layers that make those practices scale across teams, prevent skew, and close the feedback loop from production signals back to automatic rollback.
This lesson teaches you how mature LLM teams run GitOps-driven pipelines for prompts and models, use feature stores to keep embeddings consistent, run safe shadow and canary experiments on both prompts and models, and wire observability directly into automated rollback policies.
GitOps treats your prompts, feature definitions, model aliases, and deployment configuration exactly like application code[3]. The single source of truth is a git repository. A merge to the main branch (after passing required checks) is the only way production state ever changes.
A typical LLM GitOps repository looks like this:
text1llm-ops-repo/ 2โโโ prompts/ 3โ โโโ support/ 4โ โ โโโ [email protected] 5โ โ โโโ [email protected] 6โ โ โโโ tests/ 7โ โ โโโ support_golden.jsonl 8โ โโโ policy/ 9โ โโโ [email protected] 10โโโ features/ 11โ โโโ order_history/ 12โ โโโ feature_view.py # Feast-style definition 13โ โโโ entity_keys.yaml 14โโโ models/ 15โ โโโ order-status-bot/ 16โ โโโ [email protected] 17โ โโโ [email protected] # points to registry URI + hash 18โ โโโ promotion-policy.yaml 19โโโ deployments/ 20โ โโโ production/ 21โ โโโ aliases.yaml # production -> [email protected] 22โโโ .github/workflows/ 23 โโโ llm-promote.yml
The pipeline on merge does the following:
Because the desired state lives in git, a rollback is simply a revert commit or a manual edit of the alias file followed by a pipeline run. No one logs into a dashboard and clicks "promote." The audit trail is the git history.
๐ก Key insight: The most important property of GitOps for LLMs is that evaluation datasets and feature definitions are also versioned in the same repository. When a prompt PR runs the golden set, the exact feature values that will be used in production are frozen at that moment.
Embeddings are the most dangerous source of training-serving skew in modern LLM systems. A RAG pipeline or a personalized prompt may depend on dozens of embedding features: user history summary vectors, document chunk embeddings, category affinity vectors, and time-decayed engagement signals.
A feature store (such as Feast, Tecton, or a custom implementation on top of Redis + S3)[4] solves three problems at once:
In practice, your feature view for order history might look like this (simplified Feast-style):
python1from feast import Entity, FeatureView, Field 2from feast.types import Float32, Int64 3from datetime import timedelta 4 5order_history = Entity(name="user_id", join_keys=["user_id"]) 6 7order_history_view = FeatureView( 8 name="order_history_embeddings", 9 entities=[order_history], 10 ttl=timedelta(days=90), 11 schema=[ 12 Field(name="recent_order_count", dtype=Int64), 13 Field(name="avg_delivery_days_30d", dtype=Float32), 14 Field(name="order_pattern_embedding", dtype=Array(Float32, 768)), 15 ], 16 source=..., 17 online=True, 18 offline=True, 19)
At serving time the LLM gateway calls the feature store with the current user_id and a request timestamp. The store returns the vector that was valid at that exact moment, using the same transformation that was used when the golden set for the current model version was built.
Without a feature store, teams copy embedding code between training notebooks and the serving service. Over time the two copies diverge, and quality silently collapses exactly as in the OrderBuddy story.
Prompts deserve the same treatment as model weights. A one-line change in a system prompt can shift refusal rates, tone, or factual grounding more than a full fine-tune. Therefore, prompts live in their own registry with semantic versions, test suites, and promotion policies.
A prompt registry entry typically contains:
When a prompt PR merges, the CI pipeline:
[email protected] and update the candidate alias.The same pipeline supports shadow traffic for prompts. The gateway receives the live request, calls the production prompt version for the user, and in parallel (or on a sampled fraction) calls the candidate prompt version against the same model and features. The shadow responses are logged but never shown to users. After a few thousand shadow requests, the team compares judge scores, latency, and cost. Only then is the candidate promoted to a small sticky canary.
Shadow traffic is especially powerful for prompts because the blast radius of a bad prompt is often higher than a bad model weight change, yet the computational cost of running the shadow is only the extra LLM call.
The fastest way to lose user trust is to let a degraded version stay in production while engineers investigate. Once you have reliable online signals (from the observability system you built in the previous lesson), you can close the loop.
A production controller watches:
When any signal crosses a policy threshold for a sustained window, the controller automatically moves the production alias back to the last known good version. The entire operation takes under 30 seconds because only the alias changes; no containers are restarted and no weights are reloaded.
Here is a simplified policy in pseudocode:
python1class RollbackPolicy: 2 def __init__(self): 3 self.thresholds = { 4 "p95_ttft_ms": {"max": 850, "window": "5m"}, 5 "groundedness_score": {"min": 0.82, "window": "15m"}, 6 "toxicity_rate": {"max": 0.004, "window": "10m"}, 7 "task_completion_rate": {"min": 0.91, "window": "30m"}, 8 } 9 10 def should_rollback(self, metrics: Dict[str, float], current_alias: str) -> bool: 11 for metric, rule in self.thresholds.items(): 12 value = metrics.get(metric) 13 if value is None: 14 continue 15 if "max" in rule and value > rule["max"]: 16 return True 17 if "min" in rule and value < rule["min"]: 18 return True 19 return False
The controller runs this check every minute against the aggregated metrics for the current alias. Because the registry preserves full history, the rollback is always safe and fully reproducible[5].
โ ๏ธ Common mistake: Rolling back only on error rate or latency while ignoring slow quality regressions. A 3% drop in groundedness over two days will not trigger a 5xx spike, yet it destroys user confidence. LLM judges running on a sampled live stream are one of the best early warning signals for this class of regression.
The ultimate goal of advanced MLOps is complete provenance. Given any production response, an engineer must be able to answer in seconds:
All of this information is stored in the model registry entry and cross-linked to the feature store audit log and the git commit that triggered the promotion. When an incident occurs, the first step is never "guess which prompt changed." It is "look up the exact registry entry for the request timestamp and read the immutable lineage."
Before moving on, make sure you can answer these questions about the OrderBuddy scenario.
Question 1: The new "holiday shipping" feature used a fresh embedding for average delivery days. The offline eval used a snapshot from April 10. Production traffic on April 12 saw different values because the batch feature job had not yet run for the new orders. Which component should have prevented this mismatch?
Answer: The feature store with point-in-time correctness. The offline eval job should have requested features as of the exact timestamp of the golden set examples, and the online path should have used the same view. The registry should have recorded the feature view version and the snapshot timestamp alongside the model and prompt versions.
Question 2: A teammate pushes a prompt change that improves helpfulness on the golden set but increases average token count by 18%. The canary shows stable latency because the traffic is small. After full promotion, costs spike. What should have happened earlier?
Answer: The CI pipeline and the promotion policy must include cost and token budget checks as first-class gates. Cost regression is as serious as quality regression for production LLM systems. The policy should have blocked promotion or required explicit sign-off from the cost owner.
Question 3: During a canary, the LLM judge score for groundedness drops 4% on the canary variant, but the drop is within the statistical noise of the current sample size. The policy does not trigger automatic rollback. What is the correct next action?
Answer: Increase the canary sample size or extend the observation window before deciding. Never promote on marginal signals. At the same time, the team should investigate the specific failure cases the judge flagged to understand whether the regression is real or an artifact of the judge prompt itself.
With these practices in place, your LLM systems become as reliable and auditable as the best traditional software services, while still moving at the speed that modern AI teams require.
There, you will learn how to run the versioned models and prompts you now manage with advanced MLOps at massive scale: continuous batching, paged attention, KV cache aware autoscaling, cold-start mitigation, and cost-efficient multi-tenancy on GPU fleets. The MLOps practices you just mastered are what make those serving systems safe to operate in production.
Continuous Delivery for Machine Learning.
Sato, D., Wider, A., & Windheuser, C. ยท 2019
Challenges in Deploying Machine Learning: a Survey of Case Studies.
Paleyes, A., Urma, R. G., & Lawrence, N. D. ยท 2022 ยท ACM Computing Surveys
Hidden Technical Debt in Machine Learning Systems.
Sculley et al. ยท 2015
Model Registry Workflows | MLflow AI Platform
MLflow ยท 2026
Feast: Production Feature Store for Machine Learning
Feast Contributors ยท 2024
GitOps for Modern MLOps: Declarative Deployments with ArgoCD and Flux
CNCF MLOps Working Group ยท 2025
Prompt Registries and Versioned PromptOps for Production LLMs
LangSmith + PromptLayer + Helicone Teams ยท 2026
Automated Rollback Strategies for ML Systems Using Online Evaluation
Sculley et al. + Production ML Case Studies ยท 2024