Design AI lab systems with clear goals, scale math, APIs, data models, overload behavior, permissions, eval gates, and operational debugging paths.
AI lab system design rounds test whether you can turn ambiguous model-product requirements into a reliable backend. The winning answer is rarely the most complex architecture. It is the design that names the product goal, sizes the hard constraint, then adds queues, caches, model routing, evals, permissions, or human review only where requirements force them.
Use ecommerce, delivery, and customer-support examples when you need a concrete neutral domain: order lookup, shipment tracking, returns support, catalog search, and internal agent workflows have real latency, privacy, and support-debug constraints.
Open with:
I will keep the first design simple, size the constraints early, then add queues, caches, sharding, model routing, or eval gates only when the requirement forces them.
Then follow this order:
| Prompt | Core design pressure | Common miss |
|---|---|---|
| Scalable web crawler | frontier, politeness, dedupe, retry policy | ignoring per-host backpressure |
| Model API gateway | keys, workspaces, rate limits, request IDs, model routing | no support/debug path |
| Inference scheduler | queueing, batching, latency, fairness, overload | optimizing throughput before latency SLO |
| Long-running coding agents | durable tasks, tools, checkpoints, permissions | unclear recovery and cancellation |
| Permission-aware retrieval | ACLs, freshness, deletion, tenant isolation | retrieving first and filtering later |
| Evaluation/safety monitor | offline evals, incidents, red teams, launch gates | treating eval as a dashboard only |
The gateway is the front door. It authenticates API keys, resolves workspace and organization limits, estimates request cost, routes to a model or queue, and emits a request ID that support can follow.
Design checklist:
429 or degraded fallback, not silent queue growth.Use Python to sanity-check rate math before drawing capacity boxes:
1requests_per_minute = 4_000
2avg_input_tokens = 1_200
3avg_output_tokens = 450
4tokens_per_minute = requests_per_minute * (avg_input_tokens + avg_output_tokens)
5tokens_per_second = tokens_per_minute / 60
6
7print("tokens_per_minute:", tokens_per_minute)
8print("tokens_per_second:", round(tokens_per_second))1tokens_per_minute: 6600000
2tokens_per_second: 110000For LLM serving, the scheduler is where latency, cost, and fairness meet. Production serving systems use request scheduling and in-flight batching to balance throughput and latency under GPU memory constraints.[1] Mention these metrics:
For enterprise retrieval, the critical rule is: do not retrieve private data and filter it after generation. Permission constraints must be part of candidate selection, ranking, and auditing.
Architecture pieces:
Long-running agent infrastructure has to persist intent, tool calls, artifacts, checkpoints, logs, and permissions. The main design risk is not just failed execution. It is uncontrolled execution.
Cover: