LearnAI Lab InterviewingAI Lab System Design Interview

🏗️HardSystem Design

AI Lab System Design Interview

Design AI lab systems with clear goals, scale math, APIs, data models, overload behavior, permissions, eval gates, and operational debugging paths.

25 min read

Learning path

Step 156 of 158 in the full curriculum

AI Lab Coding Interview: Python Systems AI Lab Behavioral Interview

A system-design round asks you to turn ambiguous model-product requirements into a reliable backend. A strong answer is rarely the most complex architecture. It names the product goal, sizes the hard constraint, then adds queues, caches, model routing, eval gates, permissions, or human review only where requirements force them.

Use codebase, model-serving, RAG, support-automation, and deployment examples when you need a concrete neutral domain: code search, incident summarization, policy-gated billing credits, eval dashboards, and internal agent workflows have real latency, privacy, and debugging constraints.

AI lab system design answer map from goal and requirements through API, data model, architecture, reliability, safety, observability, and rollout — Strong design answers move left to right: goal, requirements, sizing, API, data model, request path, reliability, permissions, observability, rollout, and tradeoffs.

The design round script

Open with:

I will keep the first design simple, size the constraints early, then add queues, caches, sharding, model routing, or eval gates only when the requirement forces them.

Then follow this order:

Goal: who uses it and what success means.
Requirements: functional and non-functional.
Scale: QPS, tokens, tenants, documents, latency, retention.
API: external contract and important internal interfaces.
Data model: entities, indexes, isolation, retention.
Architecture: simplest request path first.
Reliability: retries, idempotency, backpressure, overload, failover.
Safety/security: permissions, audit, abuse controls, rollback.
Observability: metrics, logs, traces, support views.
Rollout: beta gates, canaries, eval gates, kill switches.

Prompt bank

Prompt	Core design pressure	Common miss
Scalable web crawler	frontier, politeness, dedupe, retry policy	ignoring per-host backpressure
Model API gateway	keys, workspaces, rate limits, request IDs, model routing	no support/debug path
Inference scheduler	queueing, batching, latency, fairness, overload	optimizing throughput before latency SLO
Long-running coding agents	durable tasks, tools, checkpoints, permissions	unclear recovery and cancellation
Permission-aware retrieval	ACLs, freshness, deletion, tenant isolation	retrieving first and filtering later
Evaluation/safety monitor	offline evals, incidents, red teams, launch gates	treating eval as a dashboard only

Design pattern taxonomy

Most frontier AI/backend prompts are combinations of these patterns. Identify the dominant pressure before drawing boxes.

Pattern	Prompt signal	Architecture moves	Follow-up pressure
Model gateway	multi-provider, teams, API keys, quotas	auth, entitlements, route policy, quota buckets, request log	streaming, fallback semantics, budget caps, support replay
Inference scheduler	latency, batching, GPUs, overload	admission queue, batcher, worker pool, KV/cache accounting, fairness	tail latency, starvation, preemption, 429 vs 503 policy
Permission-aware retrieval	enterprise docs, ACLs, deletion, citations	source connectors, ACL snapshots, filtered retrieval, audit trail	revocation SLO, fail closed, hybrid search, stale index
Agent execution platform	long-running tasks, tools, repo access	task state machine, sandbox, tool policy, event log, artifacts	cancellation, retries, secret handling, human review
Eval and rollout gate	quality launch, regressions, red team	offline evals, golden sets, canary gates, rollback triggers	slice failures, noisy judges, metric ownership
Data ingestion platform	connectors, freshness, normalization	ingestion jobs, versioned records, dead-letter queue, backfill	schema drift, reprocessing, dedupe, deletion
Observability and support	request IDs, incidents, "why did this happen?"	traces, decision records, support views, replayable metadata	privacy-safe debugging, retention, sampling
Abuse and safety control	policy, misuse, irreversible actions	policy engine, rate limits, review queues, kill switches	false positives, bypass attempts, emergency disable

Use this table to avoid architecture soup. A model gateway prompt doesn't need a vector database unless the product asks for retrieval. A retrieval prompt doesn't need autonomous agent planning unless the user asks the system to take actions.

Follow-up response bank

Interviewers often stress the first design with a new constraint. Answer by naming the boundary you'll change.

Follow-up	Good move	Bad move
"Traffic spikes 10x"	admission control, queue SLO, tier fairness, explicit `429` or `503` cause	unlimited queues
"Permissions change quickly"	ACL freshness SLO, tombstones, fail-closed sensitive sources	retrieve first, filter after generation
"Provider is down"	policy-approved fallback, circuit breaker, surfaced degraded mode	silently change model behavior
"Users need cancellation"	durable cancel flag, cooperative checks, sandbox termination	best-effort UI button only
"Support asks why"	request ID, policy version, route decision, retrieved IDs, trace	raw logs with no decision record
"Eval passes but users complain"	slice analysis, online canary metrics, incident cases into regression suite	argue offline eval is enough
"Costs doubled"	token accounting, cache hit tracking, model route policy, budget alerts	vague autoscaling

Whiteboard scorecard

Before the last five minutes, your design should contain:

One explicit API.
One durable data model.
One request path.
One overload behavior.
One security or permission boundary.
One observability story.
One rollout or eval gate.
One tradeoff with a reversal signal.

If any item is missing, add it before adding another component.

Prompt expansion matrix

Use this matrix to avoid over-practicing one design shape. A strong prep set covers control planes, data planes, eval loops, and human operations.

Prompt family	Clarify first	Core architecture	Stress follow-ups
Multi-model gateway	streaming, spend policy, beta access	auth -> quota -> route policy -> provider adapter -> audit	provider outage, 10x spike, fallback correctness
Inference scheduler	latency SLO, model size, GPU pool	admission queue -> batcher -> workers -> KV accounting	fairness, preemption, hot tenant, overload
Enterprise retrieval	revocation SLO, source types, citations	ingestion -> ACL snapshots -> filtered retrieval -> answer trace	deletion, stale ACLs, hybrid search, audit
Coding-agent service	tool permissions, write access, job length	task API -> sandbox -> event log -> artifacts -> review	cancellation, retries, secrets, stuck tools
Eval platform	launch criteria, judges, slices	dataset registry -> runners -> scorer -> report -> gate	noisy labels, drift, red-team cases, ownership
Data ingestion platform	freshness, schema drift, backfills	connectors -> normalized records -> version store -> index	dedupe, reprocessing, tombstones, source outage
Safety policy layer	allowed actions, review threshold	request context -> policy engine -> tool gate -> audit	false positives, emergency block, bypass attempts
Observability platform	support questions, retention, privacy	trace IDs -> logs/metrics/events -> support view -> replay	PII, sampling, high cardinality, incident workflow
Realtime voice or chat	latency budget, turn-taking, fallback	session gateway -> streaming model -> tool loop -> handoff	interruption, partial output, moderation, cost
Experiment system	unit of assignment, ramp plan, metrics	assignment service -> config -> logging -> analysis -> rollback	interference, sample ratio mismatch, guardrails

Scale math checklist

Every design should include one small calculation. It doesn't need perfect precision; it needs to expose the bottleneck.

System	Minimum math
Gateway	requests/minute, tokens/minute, worst-case output cap
Retrieval	documents, chunks/document, embedding storage, update rate
Scheduler	arrival rate, average service time, queue wait, GPU memory
Agent service	concurrent jobs, sandbox time, log/artifact storage, retry budget
Eval platform	examples per suite, runs per release, judge/model cost
Voice/chat	p95 latency budget split across network, model, tools, synthesis
Ingestion	source QPS, backfill duration, dedupe key cardinality

Use this phrasing:

I will size the constraint that most affects the design. If that assumption changes, the architecture boundary I would revisit is X.

Decision log habit

When a design has many possible components, keep a visible decision log:

Decision	Chosen	Rejected	Why	Reversal signal
Queue placement	before provider call	inside every adapter	one overload policy	adapter-specific SLO needed
Retrieval filter	before rerank/generation	post-generation filter	privacy fails closed	none for sensitive docs
Fallback model	policy-gated	automatic on any error	behavior may change	explicit customer opt-in
Agent writes	human-reviewed	direct writes	irreversible action risk	narrow, reversible tool scope

This keeps the conversation inspectable. Interviewers can disagree with a choice and still see that the choice was deliberate.

Clarification prompt patterns

Ask questions that change the design. Avoid long discovery interviews. Pick the three questions most likely to affect API, state, scale, or safety.

Prompt family	High-signal clarifying questions
Gateway	Are responses streamed? Which limits are hard stops? Which route changes require user opt-in?
Scheduler	What latency SLO matters: time to first token, full completion, or queue wait? Are tenants isolated by tier?
Retrieval	What revocation SLO is required? Should unknown ACL freshness fail closed? Are citations mandatory?
Agent platform	Which actions are irreversible? Can jobs push branches, or only produce patches? What must cancellation guarantee?
Eval platform	Which metric blocks launch? Who owns bad labels? How are incident cases promoted into regression suites?
Ingestion	What freshness target matters? Are deletions hard deletes, tombstones, or retention-window deletes?
Voice/chat	What is the p95 latency target? Is barge-in required? What happens when a tool call is slow?
Observability	Who asks "why did this happen?" Support, safety, billing, or engineering? What data must be redacted?

Use this opener:

I will ask three questions because they change the design: latency or scale, permission or safety, and debug or rollout.

45-minute board plan

Practice with a visible clock. A strong design round leaves time for follow-ups instead of spending 30 minutes drawing boxes.

Time	Output
0-4 min	goal, users, success metric, top risks
4-9 min	functional and non-functional requirements
9-14 min	one scale calculation that exposes the bottleneck
14-20 min	API and durable data model
20-28 min	request path with the smallest architecture that works
28-35 min	reliability, overload, permissions, and support/debug flow
35-40 min	rollout, eval gate, and rollback path
40-45 min	tradeoffs, reversal signals, and interviewer follow-ups

If the interviewer interrupts early, jump to the dominant constraint:

The part that most changes the design is constraint. I will size that first, then show the request path it forces.

Design debrief rubric

After each mock design, score the answer on evidence, not on whether the diagram looked impressive.

Signal	0	1	2
Goal	user and success metric unclear	goal named	goal tied to product risk or operator need
Requirements	list is generic	key functions named	tradeoffs and non-goals named
Scale math	no calculation	one rough estimate	estimate changes an architecture choice
API	hand-wavy endpoints	basic contract	contract includes errors, IDs, and auth context
Data model	boxes only	core entities	retention, versioning, and isolation considered
Request path	too broad	happy path clear	failure and support path included
Overload	queues forever	generic rejection	admission policy, fairness, degraded mode, and correct `429` vs `503` ownership
Permissions	mentioned late	auth boundary named	fail-closed behavior and audit path included
Observability	dashboard language	metrics listed	request ID, traces, decision records, and owner actions
Rollout	"canary" only	staged launch	eval gates, rollback trigger, and incident feedback loop
Tradeoff	one obvious choice	rejected option named	downside, mitigation, and reversal signal named

Target 18+ before treating a prompt family as ready. A lower score means repeat the same prompt with a different product surface.

Mock design prompts

Treat each prompt like a 45-minute design round. First write requirements, scale math, API, data model, request path, failure modes, and rollout plan. Then open the solution guide.

Prompt 1: model gateway for enterprise teams

Design an API gateway for teams calling multiple LLM providers through one company platform.

Prompt details:

Each organization has workspaces, users, API keys, and model access rules.
The gateway must enforce requests/minute, tokens/minute, and monthly spend limits.
Support needs a request ID that can explain which route, model, policy decision, and overload state happened.
Some models are beta-only and must be gated.
Traffic can spike 10x during customer-support incidents.

Clarifying questions to ask:

Are clients streaming responses, batch jobs, or both?
Should spend limits be hard stops, soft alerts, or tier-dependent?
Which decision must support explain first: auth failure, quota failure, route choice, or provider failure?

Solution guide

Strong answer shape:

Goal: reliable multi-model access with debuggable controls.
API: POST /v1/responses, GET /v1/requests/{id}, admin endpoints for keys and limits.
Data model: organization, workspace, key, route policy, model entitlement, usage bucket, request log.
Request path: gateway -> auth -> quota estimate -> route policy -> admission queue -> provider adapter -> stream.
Overload: return 429 with Retry-After when this tenant, key, or client exceeds its limit. Return 503 with a retry hint when healthy callers can't be admitted because fleet or provider capacity is exhausted. Route to an approved fallback only when product policy allows the behavior change.
Observability: request ID, model, token estimate, queue wait, provider latency, error class, policy version.
Rollout: canary new routes, kill switch beta models, regression tests for auth and quota bypass.

Common misses: no support path, no versioned policy record, no overload rejection, and no separation between authentication and authorization.

Follow-up guide

If asked about streaming, reserve quota from an estimate before admission, cap output length, then reconcile actual tokens at the end of the stream. If asked about fairness during a 10x spike, split queues by organization or tier so one incident can't starve everyone else. If asked about beta models, answer with a versioned entitlement check and a kill switch:

The gateway should log policy_version, entitlement_id, route_id, and quota_bucket_id on every request so support can explain both accepted and rejected traffic.

Prompt 2: permission-aware enterprise retrieval

Design retrieval for an internal assistant that answers employee questions from company documents.

Prompt details:

Documents come from multiple systems with different ACL formats.
Permission changes and deletions must take effect quickly.
Answers must cite sources.
The assistant must not retrieve and then filter private documents after generation.
Admins need auditability for "why did this answer use this document?"

Clarifying questions to ask:

What is the revocation target: seconds, minutes, or hours?
Can indexes be physically separated by tenant, or must filters enforce isolation?
Should the assistant fail closed when ACL freshness is unknown?

Solution guide

Strong answer shape:

Ingestion: connector workers pull content plus ACL snapshots and write immutable document versions.
Indexing: tenant or workspace isolation plus ACL metadata filters; deleted docs move to a tombstone state.
Query path: auth context -> eligible corpus filter -> hybrid retrieval -> rerank -> answer with citations.
Freshness: connector lag metric, ACL refresh jobs, deletion queue, and emergency purge path.
Audit: query ID, user identity, eligible filters, retrieved doc IDs, citation IDs, policy version.
Evals: recall slices by source, permission-denied tests, deletion tests, faithfulness checks.
Failure mode: if ACL state is stale or unknown, fail closed for sensitive sources.

Common misses: filtering after generation, no deletion story, no citation IDs, and no way to explain eligibility.

Follow-up guide

If asked about deletion, describe a fast tombstone path first, then slower compaction of embeddings and chunks. If asked about stale permissions, define a freshness SLO per source and fail closed for sensitive sources when the ACL snapshot is too old.

For auditability, store enough to replay the eligibility decision: user identity, groups, source ACL version, query filters, retrieved chunks, citations shown, and model response ID. That lets an admin answer "why this document?" without exposing unrelated private documents.

Prompt 3: long-running coding agent service

Design a service that accepts repository tasks and runs a coding agent asynchronously.

Prompt details:

Users submit a repo, branch, task prompt, and tool permissions.
Jobs can run for 30 minutes and may need retries or human review.
The agent can create commits, run tests, and leave artifacts.
Users need live progress, cancellation, logs, and final diff review.
Secrets and production systems must be protected.

Clarifying questions to ask:

Is the service allowed to push branches, or should it only produce a patch artifact?
Which tools need network access, and which should run offline?
What happens when a user cancels during a tool call or test run?

Solution guide

Strong answer shape:

API: create task, get task status, stream events, cancel task, list artifacts.
State machine: queued, provisioning, running, blocked, review, failed, canceled, complete.
Execution boundary: sandbox with scoped repo checkout, network policy, secret redaction, time and disk limits.
Persistence: task row, run attempts, tool calls, logs, artifacts, branch/commit refs, cancellation flag.
Recovery: checkpoint workspace state, retry transient infra failures, never replay unsafe writes without idempotency.
Observability: per-tool latency, test results, token use, queue time, sandbox exits, reviewer actions.
Rollout and safety: permission presets, allowlisted tools, audit log, kill switch by tool or model route.

Common misses: no cancellation semantics, no sandbox boundary, no artifact model, and no distinction between retrying a read and replaying a write.

Follow-up guide

If asked about retries, separate infrastructure retries from agent-action retries. Retrying a sandbox provision is safe; replaying a commit, comment, or external API write needs idempotency or human review.

If asked about cancellation, make it cooperative and durable: set a cancellation flag, stop scheduling new tool calls, terminate the sandbox after a grace window, persist partial logs, and mark artifacts as incomplete. If asked about secrets, say the sandbox receives scoped, short-lived credentials and the event log redacts values before storage.

Drill 1: API gateway and rate-control plane

The gateway is the front door. It authenticates API keys, resolves workspace and organization limits, estimates request cost, routes to a model or queue, and emits a request ID that support can follow.

Design checklist:

API keys map to workspace and organization.
Limits apply across requests/minute, tokens/minute, model, and tier.
Request IDs appear in responses and logs.
Tenant or client limits return 429; fleet or provider overload returns 503. Both carry a request ID and a bounded retry hint instead of silent queue growth.
Beta features are gated by explicit version or feature flags.
Rollout has canaries, kill switches, and regression checks.

Use Python to sanity-check rate math before drawing capacity boxes:

gateway-token-budget.py

requests_per_minute = 4_000
avg_input_tokens = 1_200
avg_output_tokens = 450
tokens_per_minute = requests_per_minute * (avg_input_tokens + avg_output_tokens)
tokens_per_second = tokens_per_minute / 60

print("tokens_per_minute:", tokens_per_minute)
print("tokens_per_second:", round(tokens_per_second))

Output

tokens_per_minute: 6600000
tokens_per_second: 110000

The estimate is an admission-time reservation, not the final bill. A streaming response needs a maximum output budget, then a reconciliation step when the stream ends:

streaming-quota-reconciliation.py

remaining_before = 10_000
reserved_tokens = 1_800
actual_tokens = 1_520

remaining_after_admission = remaining_before - reserved_tokens
remaining_after_reconcile = remaining_after_admission + (reserved_tokens - actual_tokens)

print("after admission:", remaining_after_admission)
print("after reconcile:", remaining_after_reconcile)

Output

after admission: 8200
after reconcile: 8480

Drill 2: inference scheduler

For LLM serving, the scheduler is where latency, cost, and fairness meet. NVIDIA documents in-flight batching as a way to interleave context and generation work so GPUs are used more efficiently while latency stays under control.^{[1]Reference 1Paged Attention, IFB, and Request Scheduling.https://nvidia.github.io/TensorRT-LLM/features/paged-attention-ifb-scheduler.html} Mention these metrics:

Time to first token.
Tokens per second.
Queue wait time.
p95 and p99 end-to-end latency.
GPU utilization.
Error rate and overload rejections.
Cost per successful request.

Start capacity planning with a measured workload-specific fleet capacity, not a generic GPU estimate:

scheduler-headroom.py

measured_capacity_tokens_per_second = 125_000
expected_demand_tokens_per_second = 110_000

headroom = measured_capacity_tokens_per_second - expected_demand_tokens_per_second
headroom_percent = headroom / measured_capacity_tokens_per_second * 100

print("headroom_tokens_per_second:", headroom)
print("headroom_percent:", round(headroom_percent, 1))

Output

headroom_tokens_per_second: 15000
headroom_percent: 12.0

For an interview-sized overload policy, estimate queue wait before admission. This approximation deliberately stays simple. A real scheduler also tracks prefill, decode work, GPU memory, and measured tail latency.

Keep rejection ownership visible. 429 Too Many Requests says the caller exceeded a tenant, key, or client policy and could succeed after its quota window resets. 503 Service Unavailable says the service fleet or an upstream provider lacks capacity for an otherwise eligible request. A full tenant bucket can produce 429 even when GPUs are idle; a saturated fleet can produce 503 for a tenant that is under quota.

scheduler-admission.py

def admit(queued_tokens: int, service_tokens_per_second: int, max_queue_wait_seconds: float) -> bool:
    estimated_wait = queued_tokens / service_tokens_per_second
    return estimated_wait <= max_queue_wait_seconds

print(admit(queued_tokens=20_000, service_tokens_per_second=125_000, max_queue_wait_seconds=0.25))
print(admit(queued_tokens=50_000, service_tokens_per_second=125_000, max_queue_wait_seconds=0.25))

Output

True
False

Drill 3: permission-aware retrieval

For enterprise retrieval, the critical rule is: don't retrieve private data and filter it after generation. Permission constraints must be part of candidate selection, ranking, and auditing.

Architecture pieces:

Connector ingestion workers with backpressure.
Per-document ACLs or delegated auth checks.
Tenant-isolated indexes or strict metadata filters.
Hybrid retrieval plus reranking.
Citation output with source IDs.
Deletion and retention jobs.
Offline evals for recall and answer faithfulness.
Support traces that show which documents were eligible.

The ACL snapshot and hybrid index both feed candidate selection. That connection matters: unauthorized chunks shouldn't reach the reranker or model context.

For sensitive sources, encode the freshness decision as a fail-closed policy:

acl-freshness-policy.py

def source_is_eligible(snapshot_age_seconds: int, freshness_slo_seconds: int, sensitive: bool) -> bool:
    if sensitive and snapshot_age_seconds > freshness_slo_seconds:
        return False
    return True

print(source_is_eligible(snapshot_age_seconds=20, freshness_slo_seconds=60, sensitive=True))
print(source_is_eligible(snapshot_age_seconds=90, freshness_slo_seconds=60, sensitive=True))
print(source_is_eligible(snapshot_age_seconds=90, freshness_slo_seconds=60, sensitive=False))

Output

True
False
True

Filter tombstones and ACLs before ranking. The tiny fixture below makes the order visible:

acl-filter-before-rank.py

documents = [
    {"id": "public-access-policy", "groups": {"employees"}, "deleted": False, "score": 0.82},
    {"id": "finance-plan", "groups": {"finance"}, "deleted": False, "score": 0.99},
    {"id": "old-handbook", "groups": {"employees"}, "deleted": True, "score": 0.95},
]
user_groups = {"employees"}

eligible = [
    document
    for document in documents
    if not document["deleted"] and document["groups"] & user_groups
]
ranked_ids = [document["id"] for document in sorted(eligible, key=lambda item: item["score"], reverse=True)]

print(ranked_ids)

Output

['public-access-policy']

Drill 4: long-running coding agents

Long-running agent infrastructure has to persist intent, tool calls, artifacts, checkpoints, logs, and permissions. The main design risk isn't just failed execution. It's uncontrolled execution.

Cover:

Task states: queued, provisioning, running, blocked, review, failed, canceled, complete.
Checkpoints for resumability.
Tool permission scopes and audit logs.
Secret redaction.
Git branch and conflict handling.
Streaming progress.
Cancellation and deadlines.
Evals for task success and regression.

Write legal transitions down before discussing workers. That prevents a cancellation or retry from jumping into an impossible state:

agent-task-state-machine.py

ALLOWED_TRANSITIONS = {
    "queued": {"provisioning", "canceled"},
    "provisioning": {"running", "failed", "canceled"},
    "running": {"blocked", "review", "failed", "canceled"},
    "blocked": {"running", "review", "failed", "canceled"},
    "review": {"running", "complete", "canceled"},
    "failed": set(),
    "canceled": set(),
    "complete": set(),
}

def can_transition(current: str, target: str) -> bool:
    return target in ALLOWED_TRANSITIONS[current]

print(can_transition("running", "review"))
print(can_transition("complete", "running"))

Output

True
False

Retries need a policy boundary too. Reads and sandbox provisioning can retry automatically. External writes need an idempotency key or review:

agent-retry-policy.py

def retry_mode(action: str, has_idempotency_key: bool = False) -> str:
    if action in {"repo_read", "sandbox_provision"}:
        return "automatic"
    if has_idempotency_key:
        return "automatic-with-idempotency"
    return "human-review"

print(retry_mode("repo_read"))
print(retry_mode("create_commit"))
print(retry_mode("post_comment", has_idempotency_key=True))

Output

automatic
human-review
automatic-with-idempotency

Drill 5: eval gates and staged rollout

An evaluation monitor isn't just a dashboard. It turns fixed expectations and incident discoveries into a regression suite, then blocks launch when a critical slice fails.

eval-launch-gate.py

checks = {
    "retrieval_recall": (0.94, 0.92),
    "citation_faithfulness": (0.93, 0.95),
}
permission_leaks = 0

failures = [
    name
    for name, (observed, minimum) in checks.items()
    if observed < minimum
]
if permission_leaks > 0:
    failures.append("permission_leaks")

print("launch_allowed:", not failures)
print("failures:", failures)

Output

launch_allowed: False
failures: ['citation_faithfulness']

Keep online canaries reversible. Offline evals can pass while latency, provider errors, or permission failures regress under real traffic:

canary-rollout-decision.py

def rollout_decision(error_rate: float, p95_latency_ms: int, permission_failures: int) -> str:
    if permission_failures > 0:
        return "rollback"
    if error_rate > 0.01 or p95_latency_ms > 900:
        return "hold"
    return "expand"

print(rollout_decision(error_rate=0.004, p95_latency_ms=720, permission_failures=0))
print(rollout_decision(error_rate=0.004, p95_latency_ms=720, permission_failures=1))

Output

expand
rollback

Failure modes to avoid

Starting with implementation technology before naming user value.
Caching without invalidation, privacy, or freshness.
Monitoring without exact metrics.
Ignoring overload and support/debug needs.
Treating safety as a slogan instead of evals, permissions, staged rollout, and rollback.
Forgetting that agent systems need reversible actions and audit trails.

Mastery checklist

Design an API gateway with request IDs, workspaces, rate limits, beta gates, and overload behavior.
Design an inference scheduler with queue metrics, batching, fairness, tenant 429 behavior, and fleet 503 behavior.
Design a permission-aware retrieval system with ACLs, deletion, citations, and evals.
Design long-running coding-agent infrastructure with durable state, permissions, logs, and cancellation.
Design an evaluation monitor that turns incidents into regression tests and launch gates.

Next Step

Continue to AI Lab Behavioral Interview

You'll translate production engineering evidence into values, judgment, disagreement, incident, and mission-fit answers without sounding rehearsed.

PreviousAI Lab Coding Interview: Python Systems

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Paged Attention, IFB, and Request Scheduling.

NVIDIA · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnAI Lab InterviewingAI Lab System Design Interview

🏗️HardSystem Design

AI Lab System Design Interview

Design AI lab systems with clear goals, scale math, APIs, data models, overload behavior, permissions, eval gates, and operational debugging paths.

25 min read

Learning path

Step 156 of 158 in the full curriculum

AI Lab Coding Interview: Python Systems AI Lab Behavioral Interview

The design round script

Open with:

I will keep the first design simple, size the constraints early, then add queues, caches, sharding, model routing, or eval gates only when the requirement forces them.

Then follow this order:

Goal: who uses it and what success means.
Requirements: functional and non-functional.
Scale: QPS, tokens, tenants, documents, latency, retention.
API: external contract and important internal interfaces.
Data model: entities, indexes, isolation, retention.
Architecture: simplest request path first.
Reliability: retries, idempotency, backpressure, overload, failover.
Safety/security: permissions, audit, abuse controls, rollback.
Observability: metrics, logs, traces, support views.
Rollout: beta gates, canaries, eval gates, kill switches.

Prompt bank

Prompt	Core design pressure	Common miss
Scalable web crawler	frontier, politeness, dedupe, retry policy	ignoring per-host backpressure
Model API gateway	keys, workspaces, rate limits, request IDs, model routing	no support/debug path
Inference scheduler	queueing, batching, latency, fairness, overload	optimizing throughput before latency SLO
Long-running coding agents	durable tasks, tools, checkpoints, permissions	unclear recovery and cancellation
Permission-aware retrieval	ACLs, freshness, deletion, tenant isolation	retrieving first and filtering later
Evaluation/safety monitor	offline evals, incidents, red teams, launch gates	treating eval as a dashboard only

Design pattern taxonomy

Most frontier AI/backend prompts are combinations of these patterns. Identify the dominant pressure before drawing boxes.

Pattern	Prompt signal	Architecture moves	Follow-up pressure
Model gateway	multi-provider, teams, API keys, quotas	auth, entitlements, route policy, quota buckets, request log	streaming, fallback semantics, budget caps, support replay
Inference scheduler	latency, batching, GPUs, overload	admission queue, batcher, worker pool, KV/cache accounting, fairness	tail latency, starvation, preemption, 429 vs 503 policy
Permission-aware retrieval	enterprise docs, ACLs, deletion, citations	source connectors, ACL snapshots, filtered retrieval, audit trail	revocation SLO, fail closed, hybrid search, stale index
Agent execution platform	long-running tasks, tools, repo access	task state machine, sandbox, tool policy, event log, artifacts	cancellation, retries, secret handling, human review
Eval and rollout gate	quality launch, regressions, red team	offline evals, golden sets, canary gates, rollback triggers	slice failures, noisy judges, metric ownership
Data ingestion platform	connectors, freshness, normalization	ingestion jobs, versioned records, dead-letter queue, backfill	schema drift, reprocessing, dedupe, deletion
Observability and support	request IDs, incidents, "why did this happen?"	traces, decision records, support views, replayable metadata	privacy-safe debugging, retention, sampling
Abuse and safety control	policy, misuse, irreversible actions	policy engine, rate limits, review queues, kill switches	false positives, bypass attempts, emergency disable

Follow-up response bank

Interviewers often stress the first design with a new constraint. Answer by naming the boundary you'll change.

Follow-up	Good move	Bad move
"Traffic spikes 10x"	admission control, queue SLO, tier fairness, explicit `429` or `503` cause	unlimited queues
"Permissions change quickly"	ACL freshness SLO, tombstones, fail-closed sensitive sources	retrieve first, filter after generation
"Provider is down"	policy-approved fallback, circuit breaker, surfaced degraded mode	silently change model behavior
"Users need cancellation"	durable cancel flag, cooperative checks, sandbox termination	best-effort UI button only
"Support asks why"	request ID, policy version, route decision, retrieved IDs, trace	raw logs with no decision record
"Eval passes but users complain"	slice analysis, online canary metrics, incident cases into regression suite	argue offline eval is enough
"Costs doubled"	token accounting, cache hit tracking, model route policy, budget alerts	vague autoscaling

Whiteboard scorecard

Before the last five minutes, your design should contain:

One explicit API.
One durable data model.
One request path.
One overload behavior.
One security or permission boundary.
One observability story.
One rollout or eval gate.
One tradeoff with a reversal signal.

If any item is missing, add it before adding another component.

Prompt expansion matrix

Use this matrix to avoid over-practicing one design shape. A strong prep set covers control planes, data planes, eval loops, and human operations.

Prompt family	Clarify first	Core architecture	Stress follow-ups
Multi-model gateway	streaming, spend policy, beta access	auth -> quota -> route policy -> provider adapter -> audit	provider outage, 10x spike, fallback correctness
Inference scheduler	latency SLO, model size, GPU pool	admission queue -> batcher -> workers -> KV accounting	fairness, preemption, hot tenant, overload
Enterprise retrieval	revocation SLO, source types, citations	ingestion -> ACL snapshots -> filtered retrieval -> answer trace	deletion, stale ACLs, hybrid search, audit
Coding-agent service	tool permissions, write access, job length	task API -> sandbox -> event log -> artifacts -> review	cancellation, retries, secrets, stuck tools
Eval platform	launch criteria, judges, slices	dataset registry -> runners -> scorer -> report -> gate	noisy labels, drift, red-team cases, ownership
Data ingestion platform	freshness, schema drift, backfills	connectors -> normalized records -> version store -> index	dedupe, reprocessing, tombstones, source outage
Safety policy layer	allowed actions, review threshold	request context -> policy engine -> tool gate -> audit	false positives, emergency block, bypass attempts
Observability platform	support questions, retention, privacy	trace IDs -> logs/metrics/events -> support view -> replay	PII, sampling, high cardinality, incident workflow
Realtime voice or chat	latency budget, turn-taking, fallback	session gateway -> streaming model -> tool loop -> handoff	interruption, partial output, moderation, cost
Experiment system	unit of assignment, ramp plan, metrics	assignment service -> config -> logging -> analysis -> rollback	interference, sample ratio mismatch, guardrails

Scale math checklist

Every design should include one small calculation. It doesn't need perfect precision; it needs to expose the bottleneck.

System	Minimum math
Gateway	requests/minute, tokens/minute, worst-case output cap
Retrieval	documents, chunks/document, embedding storage, update rate
Scheduler	arrival rate, average service time, queue wait, GPU memory
Agent service	concurrent jobs, sandbox time, log/artifact storage, retry budget
Eval platform	examples per suite, runs per release, judge/model cost
Voice/chat	p95 latency budget split across network, model, tools, synthesis
Ingestion	source QPS, backfill duration, dedupe key cardinality

Use this phrasing:

I will size the constraint that most affects the design. If that assumption changes, the architecture boundary I would revisit is X.

Decision log habit

When a design has many possible components, keep a visible decision log:

Decision	Chosen	Rejected	Why	Reversal signal
Queue placement	before provider call	inside every adapter	one overload policy	adapter-specific SLO needed
Retrieval filter	before rerank/generation	post-generation filter	privacy fails closed	none for sensitive docs
Fallback model	policy-gated	automatic on any error	behavior may change	explicit customer opt-in
Agent writes	human-reviewed	direct writes	irreversible action risk	narrow, reversible tool scope

This keeps the conversation inspectable. Interviewers can disagree with a choice and still see that the choice was deliberate.

Clarification prompt patterns

Ask questions that change the design. Avoid long discovery interviews. Pick the three questions most likely to affect API, state, scale, or safety.

Prompt family	High-signal clarifying questions
Gateway	Are responses streamed? Which limits are hard stops? Which route changes require user opt-in?
Scheduler	What latency SLO matters: time to first token, full completion, or queue wait? Are tenants isolated by tier?
Retrieval	What revocation SLO is required? Should unknown ACL freshness fail closed? Are citations mandatory?
Agent platform	Which actions are irreversible? Can jobs push branches, or only produce patches? What must cancellation guarantee?
Eval platform	Which metric blocks launch? Who owns bad labels? How are incident cases promoted into regression suites?
Ingestion	What freshness target matters? Are deletions hard deletes, tombstones, or retention-window deletes?
Voice/chat	What is the p95 latency target? Is barge-in required? What happens when a tool call is slow?
Observability	Who asks "why did this happen?" Support, safety, billing, or engineering? What data must be redacted?

Use this opener:

I will ask three questions because they change the design: latency or scale, permission or safety, and debug or rollout.

45-minute board plan

Practice with a visible clock. A strong design round leaves time for follow-ups instead of spending 30 minutes drawing boxes.

Time	Output
0-4 min	goal, users, success metric, top risks
4-9 min	functional and non-functional requirements
9-14 min	one scale calculation that exposes the bottleneck
14-20 min	API and durable data model
20-28 min	request path with the smallest architecture that works
28-35 min	reliability, overload, permissions, and support/debug flow
35-40 min	rollout, eval gate, and rollback path
40-45 min	tradeoffs, reversal signals, and interviewer follow-ups

If the interviewer interrupts early, jump to the dominant constraint:

The part that most changes the design is constraint. I will size that first, then show the request path it forces.

Design debrief rubric

After each mock design, score the answer on evidence, not on whether the diagram looked impressive.

Signal	0	1	2
Goal	user and success metric unclear	goal named	goal tied to product risk or operator need
Requirements	list is generic	key functions named	tradeoffs and non-goals named
Scale math	no calculation	one rough estimate	estimate changes an architecture choice
API	hand-wavy endpoints	basic contract	contract includes errors, IDs, and auth context
Data model	boxes only	core entities	retention, versioning, and isolation considered
Request path	too broad	happy path clear	failure and support path included
Overload	queues forever	generic rejection	admission policy, fairness, degraded mode, and correct `429` vs `503` ownership
Permissions	mentioned late	auth boundary named	fail-closed behavior and audit path included
Observability	dashboard language	metrics listed	request ID, traces, decision records, and owner actions
Rollout	"canary" only	staged launch	eval gates, rollback trigger, and incident feedback loop
Tradeoff	one obvious choice	rejected option named	downside, mitigation, and reversal signal named

Target 18+ before treating a prompt family as ready. A lower score means repeat the same prompt with a different product surface.

Mock design prompts

Treat each prompt like a 45-minute design round. First write requirements, scale math, API, data model, request path, failure modes, and rollout plan. Then open the solution guide.

Prompt 1: model gateway for enterprise teams

Design an API gateway for teams calling multiple LLM providers through one company platform.

Prompt details:

Each organization has workspaces, users, API keys, and model access rules.
The gateway must enforce requests/minute, tokens/minute, and monthly spend limits.
Support needs a request ID that can explain which route, model, policy decision, and overload state happened.
Some models are beta-only and must be gated.
Traffic can spike 10x during customer-support incidents.

Clarifying questions to ask:

Are clients streaming responses, batch jobs, or both?
Should spend limits be hard stops, soft alerts, or tier-dependent?
Which decision must support explain first: auth failure, quota failure, route choice, or provider failure?

Solution guide

Strong answer shape:

Goal: reliable multi-model access with debuggable controls.
API: POST /v1/responses, GET /v1/requests/{id}, admin endpoints for keys and limits.
Data model: organization, workspace, key, route policy, model entitlement, usage bucket, request log.
Request path: gateway -> auth -> quota estimate -> route policy -> admission queue -> provider adapter -> stream.
Overload: return 429 with Retry-After when this tenant, key, or client exceeds its limit. Return 503 with a retry hint when healthy callers can't be admitted because fleet or provider capacity is exhausted. Route to an approved fallback only when product policy allows the behavior change.
Observability: request ID, model, token estimate, queue wait, provider latency, error class, policy version.
Rollout: canary new routes, kill switch beta models, regression tests for auth and quota bypass.

Common misses: no support path, no versioned policy record, no overload rejection, and no separation between authentication and authorization.

Follow-up guide

The gateway should log policy_version, entitlement_id, route_id, and quota_bucket_id on every request so support can explain both accepted and rejected traffic.

Prompt 2: permission-aware enterprise retrieval

Design retrieval for an internal assistant that answers employee questions from company documents.

Prompt details:

Documents come from multiple systems with different ACL formats.
Permission changes and deletions must take effect quickly.
Answers must cite sources.
The assistant must not retrieve and then filter private documents after generation.
Admins need auditability for "why did this answer use this document?"

Clarifying questions to ask:

What is the revocation target: seconds, minutes, or hours?
Can indexes be physically separated by tenant, or must filters enforce isolation?
Should the assistant fail closed when ACL freshness is unknown?

Solution guide

Strong answer shape:

Ingestion: connector workers pull content plus ACL snapshots and write immutable document versions.
Indexing: tenant or workspace isolation plus ACL metadata filters; deleted docs move to a tombstone state.
Query path: auth context -> eligible corpus filter -> hybrid retrieval -> rerank -> answer with citations.
Freshness: connector lag metric, ACL refresh jobs, deletion queue, and emergency purge path.
Audit: query ID, user identity, eligible filters, retrieved doc IDs, citation IDs, policy version.
Evals: recall slices by source, permission-denied tests, deletion tests, faithfulness checks.
Failure mode: if ACL state is stale or unknown, fail closed for sensitive sources.

Common misses: filtering after generation, no deletion story, no citation IDs, and no way to explain eligibility.

Follow-up guide

Prompt 3: long-running coding agent service

Design a service that accepts repository tasks and runs a coding agent asynchronously.

Prompt details:

Users submit a repo, branch, task prompt, and tool permissions.
Jobs can run for 30 minutes and may need retries or human review.
The agent can create commits, run tests, and leave artifacts.
Users need live progress, cancellation, logs, and final diff review.
Secrets and production systems must be protected.

Clarifying questions to ask:

Is the service allowed to push branches, or should it only produce a patch artifact?
Which tools need network access, and which should run offline?
What happens when a user cancels during a tool call or test run?

Solution guide

Strong answer shape:

API: create task, get task status, stream events, cancel task, list artifacts.
State machine: queued, provisioning, running, blocked, review, failed, canceled, complete.
Execution boundary: sandbox with scoped repo checkout, network policy, secret redaction, time and disk limits.
Persistence: task row, run attempts, tool calls, logs, artifacts, branch/commit refs, cancellation flag.
Recovery: checkpoint workspace state, retry transient infra failures, never replay unsafe writes without idempotency.
Observability: per-tool latency, test results, token use, queue time, sandbox exits, reviewer actions.
Rollout and safety: permission presets, allowlisted tools, audit log, kill switch by tool or model route.

Common misses: no cancellation semantics, no sandbox boundary, no artifact model, and no distinction between retrying a read and replaying a write.

Follow-up guide

Drill 1: API gateway and rate-control plane

Design checklist:

API keys map to workspace and organization.
Limits apply across requests/minute, tokens/minute, model, and tier.
Request IDs appear in responses and logs.
Tenant or client limits return 429; fleet or provider overload returns 503. Both carry a request ID and a bounded retry hint instead of silent queue growth.
Beta features are gated by explicit version or feature flags.
Rollout has canaries, kill switches, and regression checks.

Use Python to sanity-check rate math before drawing capacity boxes:

gateway-token-budget.py

requests_per_minute = 4_000
avg_input_tokens = 1_200
avg_output_tokens = 450
tokens_per_minute = requests_per_minute * (avg_input_tokens + avg_output_tokens)
tokens_per_second = tokens_per_minute / 60

print("tokens_per_minute:", tokens_per_minute)
print("tokens_per_second:", round(tokens_per_second))

Output

tokens_per_minute: 6600000
tokens_per_second: 110000

The estimate is an admission-time reservation, not the final bill. A streaming response needs a maximum output budget, then a reconciliation step when the stream ends:

streaming-quota-reconciliation.py

remaining_before = 10_000
reserved_tokens = 1_800
actual_tokens = 1_520

remaining_after_admission = remaining_before - reserved_tokens
remaining_after_reconcile = remaining_after_admission + (reserved_tokens - actual_tokens)

print("after admission:", remaining_after_admission)
print("after reconcile:", remaining_after_reconcile)

Output

after admission: 8200
after reconcile: 8480

Drill 2: inference scheduler

Time to first token.
Tokens per second.
Queue wait time.
p95 and p99 end-to-end latency.
GPU utilization.
Error rate and overload rejections.
Cost per successful request.

Start capacity planning with a measured workload-specific fleet capacity, not a generic GPU estimate:

scheduler-headroom.py

measured_capacity_tokens_per_second = 125_000
expected_demand_tokens_per_second = 110_000

headroom = measured_capacity_tokens_per_second - expected_demand_tokens_per_second
headroom_percent = headroom / measured_capacity_tokens_per_second * 100

print("headroom_tokens_per_second:", headroom)
print("headroom_percent:", round(headroom_percent, 1))

Output

headroom_tokens_per_second: 15000
headroom_percent: 12.0

scheduler-admission.py

def admit(queued_tokens: int, service_tokens_per_second: int, max_queue_wait_seconds: float) -> bool:
    estimated_wait = queued_tokens / service_tokens_per_second
    return estimated_wait <= max_queue_wait_seconds

print(admit(queued_tokens=20_000, service_tokens_per_second=125_000, max_queue_wait_seconds=0.25))
print(admit(queued_tokens=50_000, service_tokens_per_second=125_000, max_queue_wait_seconds=0.25))

Output

True
False

Drill 3: permission-aware retrieval

For enterprise retrieval, the critical rule is: don't retrieve private data and filter it after generation. Permission constraints must be part of candidate selection, ranking, and auditing.

Architecture pieces:

Connector ingestion workers with backpressure.
Per-document ACLs or delegated auth checks.
Tenant-isolated indexes or strict metadata filters.
Hybrid retrieval plus reranking.
Citation output with source IDs.
Deletion and retention jobs.
Offline evals for recall and answer faithfulness.
Support traces that show which documents were eligible.

The ACL snapshot and hybrid index both feed candidate selection. That connection matters: unauthorized chunks shouldn't reach the reranker or model context.

For sensitive sources, encode the freshness decision as a fail-closed policy:

acl-freshness-policy.py

def source_is_eligible(snapshot_age_seconds: int, freshness_slo_seconds: int, sensitive: bool) -> bool:
    if sensitive and snapshot_age_seconds > freshness_slo_seconds:
        return False
    return True

print(source_is_eligible(snapshot_age_seconds=20, freshness_slo_seconds=60, sensitive=True))
print(source_is_eligible(snapshot_age_seconds=90, freshness_slo_seconds=60, sensitive=True))
print(source_is_eligible(snapshot_age_seconds=90, freshness_slo_seconds=60, sensitive=False))

Output

True
False
True

Filter tombstones and ACLs before ranking. The tiny fixture below makes the order visible:

acl-filter-before-rank.py

documents = [
    {"id": "public-access-policy", "groups": {"employees"}, "deleted": False, "score": 0.82},
    {"id": "finance-plan", "groups": {"finance"}, "deleted": False, "score": 0.99},
    {"id": "old-handbook", "groups": {"employees"}, "deleted": True, "score": 0.95},
]
user_groups = {"employees"}

eligible = [
    document
    for document in documents
    if not document["deleted"] and document["groups"] & user_groups
]
ranked_ids = [document["id"] for document in sorted(eligible, key=lambda item: item["score"], reverse=True)]

print(ranked_ids)

Output

['public-access-policy']

Drill 4: long-running coding agents

Long-running agent infrastructure has to persist intent, tool calls, artifacts, checkpoints, logs, and permissions. The main design risk isn't just failed execution. It's uncontrolled execution.

Cover:

Task states: queued, provisioning, running, blocked, review, failed, canceled, complete.
Checkpoints for resumability.
Tool permission scopes and audit logs.
Secret redaction.
Git branch and conflict handling.
Streaming progress.
Cancellation and deadlines.
Evals for task success and regression.

Write legal transitions down before discussing workers. That prevents a cancellation or retry from jumping into an impossible state:

agent-task-state-machine.py

ALLOWED_TRANSITIONS = {
    "queued": {"provisioning", "canceled"},
    "provisioning": {"running", "failed", "canceled"},
    "running": {"blocked", "review", "failed", "canceled"},
    "blocked": {"running", "review", "failed", "canceled"},
    "review": {"running", "complete", "canceled"},
    "failed": set(),
    "canceled": set(),
    "complete": set(),
}

def can_transition(current: str, target: str) -> bool:
    return target in ALLOWED_TRANSITIONS[current]

print(can_transition("running", "review"))
print(can_transition("complete", "running"))

Output

True
False

Retries need a policy boundary too. Reads and sandbox provisioning can retry automatically. External writes need an idempotency key or review:

agent-retry-policy.py

def retry_mode(action: str, has_idempotency_key: bool = False) -> str:
    if action in {"repo_read", "sandbox_provision"}:
        return "automatic"
    if has_idempotency_key:
        return "automatic-with-idempotency"
    return "human-review"

print(retry_mode("repo_read"))
print(retry_mode("create_commit"))
print(retry_mode("post_comment", has_idempotency_key=True))

Output

automatic
human-review
automatic-with-idempotency

Drill 5: eval gates and staged rollout

An evaluation monitor isn't just a dashboard. It turns fixed expectations and incident discoveries into a regression suite, then blocks launch when a critical slice fails.

eval-launch-gate.py

checks = {
    "retrieval_recall": (0.94, 0.92),
    "citation_faithfulness": (0.93, 0.95),
}
permission_leaks = 0

failures = [
    name
    for name, (observed, minimum) in checks.items()
    if observed < minimum
]
if permission_leaks > 0:
    failures.append("permission_leaks")

print("launch_allowed:", not failures)
print("failures:", failures)

Output

launch_allowed: False
failures: ['citation_faithfulness']

Keep online canaries reversible. Offline evals can pass while latency, provider errors, or permission failures regress under real traffic:

canary-rollout-decision.py

def rollout_decision(error_rate: float, p95_latency_ms: int, permission_failures: int) -> str:
    if permission_failures > 0:
        return "rollback"
    if error_rate > 0.01 or p95_latency_ms > 900:
        return "hold"
    return "expand"

print(rollout_decision(error_rate=0.004, p95_latency_ms=720, permission_failures=0))
print(rollout_decision(error_rate=0.004, p95_latency_ms=720, permission_failures=1))

Output

expand
rollback

Failure modes to avoid

Starting with implementation technology before naming user value.
Caching without invalidation, privacy, or freshness.
Monitoring without exact metrics.
Ignoring overload and support/debug needs.
Treating safety as a slogan instead of evals, permissions, staged rollout, and rollback.
Forgetting that agent systems need reversible actions and audit trails.

Mastery checklist

Design an API gateway with request IDs, workspaces, rate limits, beta gates, and overload behavior.
Design an inference scheduler with queue metrics, batching, fairness, tenant 429 behavior, and fleet 503 behavior.
Design a permission-aware retrieval system with ACLs, deletion, citations, and evals.
Design long-running coding-agent infrastructure with durable state, permissions, logs, and cancellation.
Design an evaluation monitor that turns incidents into regression tests and launch gates.

Next Step

Continue to AI Lab Behavioral Interview

You'll translate production engineering evidence into values, judgment, disagreement, incident, and mission-fit answers without sounding rehearsed.

PreviousAI Lab Coding Interview: Python Systems

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Paged Attention, IFB, and Request Scheduling.

NVIDIA · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

AI Lab System Design Interview

The design round script

Why start with the goal instead of Kafka, GPUs, or vector databases?

Prompt bank

Design pattern taxonomy

Follow-up response bank

Whiteboard scorecard

Prompt expansion matrix

Scale math checklist

Decision log habit

Clarification prompt patterns

45-minute board plan

Design debrief rubric

Mock design prompts

Prompt 1: model gateway for enterprise teams

Follow-up guide

Prompt 2: permission-aware enterprise retrieval

Follow-up guide

Prompt 3: long-running coding agent service

Follow-up guide

Why is a support/debug path part of the design, not an afterthought?

Drill 1: API gateway and rate-control plane

Drill 2: inference scheduler

When should the system return 429 or 503 instead of queueing more work?

Drill 3: permission-aware retrieval

Drill 4: long-running coding agents

Drill 5: eval gates and staged rollout

Failure modes to avoid

Mastery checklist

Mastery Check

Discussion

AI Lab System Design Interview

The design round script

Why start with the goal instead of Kafka, GPUs, or vector databases?

Prompt bank

Design pattern taxonomy

Follow-up response bank

Whiteboard scorecard

Prompt expansion matrix

Scale math checklist

Decision log habit

Clarification prompt patterns

45-minute board plan

Design debrief rubric

Mock design prompts

Prompt 1: model gateway for enterprise teams

Follow-up guide

Prompt 2: permission-aware enterprise retrieval

Follow-up guide

Prompt 3: long-running coding agent service

Follow-up guide

Why is a support/debug path part of the design, not an afterthought?

Drill 1: API gateway and rate-control plane

Drill 2: inference scheduler

When should the system return 429 or 503 instead of queueing more work?

Drill 3: permission-aware retrieval

Drill 4: long-running coding agents

Drill 5: eval gates and staged rollout

Failure modes to avoid

Mastery checklist

Mastery Check

Discussion