LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnSystem Design CapstonesCode Completion System
🏗️HardSystem Design

Code Completion System

Design a real-time code completion path with context construction, measured serving latency, privacy controls, and stale-result suppression.

42 min read
Learning path
Step 144 of 155 in the full curriculum
Content Moderation SystemMulti-Tenant LLM Platform

Content moderation designed a low-latency safety pipeline that decides whether content can enter the product. Code completion uses many of the same production muscles, but the pressure point changes: every keystroke can create a fresh inference request, and stale answers are worse than no answer.

A code completion system predicts useful edits under a latency budget tight enough to run while a developer is still typing. This design chapter covers context collection, ranking, serving, privacy, and feedback loops for developer tools.

Imagine editing warehouse-routing code where the next conditional, test assertion, or API call appears before you move to another line. Code completion tools can provide dimmed inline suggestions as a developer types, and can also offer next-edit suggestions.[1] Building that experience means balancing speed, context, and data handling. In this chapter's design scenario, the inline path has a 200 ms service objective; a real product must set and validate its own objective from user research and telemetry.

Code completion architecture showing editor snapshot, context engine, gateway, inference, and ghost text response, plus separate local and LLM serving lanes and an inline latency budget chart. Code completion architecture showing editor snapshot, context engine, gateway, inference, and ghost text response, plus separate local and LLM serving lanes and an inline latency budget chart.
Code completion is a keystroke-latency system: local semantic completion handles exact cases, while remote generation spends work only when context and product value justify it.

Code completion isn't just a popup list anymore. It has evolved into an intelligent developer assistant that uses context to predict and generate entire logic blocks, reducing cognitive load and automating repetitive tasks.

Modern coding assistants now span inline completion (suggestions at the cursor), next edit suggestions (updating nearby code without moving the cursor), and full coding agents (systems that can plan and execute multi-file changes from a higher-level request).

By this point in the curriculum, you've already seen how Transformers predict the next token and how the avoids recomputing shared work. This article puts those ideas into a product: a code completion system that may consider each edit event, but only sends qualified requests and only displays fresh results.

The evolution of code completion

To understand how modern completion works, it helps to see how we got here.

EraTechnologyKey Feature
1970s-90sLexical completion & symbol tablesLocal identifier and keyword completion
1996+ (IntelliSense, Microsoft's contextual code-completion service)Static analysis & ASTsType-aware method suggestions
2010sN-gram models & basic MLStatistical patterns from GitHub
2021+ (Copilot)LLMs & transformersEntire-function generation
2025+ (Agentic)Retrieval + tool useMulti-file editing, tests, terminal actions

Static and rule-based

Early completion engines were mostly lexical. They used token scanners, prefix indexes, symbol tables, and simple scope rules. That was enough to complete local variables and imported symbols, but not enough to understand intent.

Semantic completion

IntelliSense popularized semantic completion in mainstream IDEs. Under the hood, modern semantic engines combine parsing, symbol tables, type inference, and compiler services. They know order.status is a string field on a concrete type, not just another token that happens to follow a dot.

The transformer revolution

Starting in 2021, tools like GitHub Copilot shifted the approach. The system stopped looking at code as plain text and started seeing it as a language pattern, using Large Language Models to predict entire lines or blocks.

Agentic era

By 2026, many coding products have started to blend inline completion with broader agent workflows such as retrieval, terminal commands, and repository-wide edits.[2] That's a layer above plain autocomplete, but it still depends on the same foundations: fast context gathering, good candidate generation, and tight latency control.

System requirements

An inline suggestion competes with the user's next keystroke. If it arrives too late, the user has already written past it. If it is irrelevant, it becomes a distraction rather than an aid.

Designing for this environment requires balancing competing priorities: providing the model with enough context to be accurate while executing the inference quickly enough to be useful. The key constraints include:

  • Latency: For this scenario, keep single-line inline completion under 200 ms at p95; permit a larger separately measured budget for multi-line suggestions.
  • Context: Must use open files, imports, function signatures, and project structure, not just the current file.
  • Quality: Healthy acceptance rates plus strong accepted-and-retained character metrics. The exact target varies by language, editor UX, and how aggressively the client decides to show suggestions.
  • Scale: Large global developer fleets with strong diurnal traffic spikes, requiring efficient GPU utilization.
  • Privacy: Define what code may leave the client, what is retained, whether training use is disabled by default, and how tenants are isolated.

Meeting these requirements requires more than exposing a chat endpoint. The full pipeline, from the client-side edit listener to the inference engine, needs measurement for time-to-first-token (TTFT). TTFT measures the delay from request submission until the first response token arrives.

One practical way to reason about the budget is to split it across stages: editor + network overhead, context assembly, first-token inference, and UI rendering. The exact numbers vary by region and model size, but the important point is that every stage is on the clock.

StageTypical budgetNotes
Client event handling + local parse5-15msCapture the keystroke, cursor position, and lightweight syntax state.
Network round trip20-70msDepends heavily on region and whether the request stays close to the user.
Context assembly10-40msBuild the prompt, gather nearby symbols, and fetch a few related files.
First-token inference40-90msUsually the hardest budget to hit because prefill dominates.
UI render5-15msPaint ghost text and avoid jank in the editor.

For the scenario below, those stage budgets fit under a 200 ms p95 objective. Multi-line suggestions can have a different objective, but they still need cancellation and freshness checks.

Use an executable budget check instead of treating a latency target as a promise. This small calculation fails the candidate path when any stage pushes total latency over the scenario objective:

latency-budget-check.py
1STAGE_BUDGET_MS = { 2 "client_parse": 12, 3 "network": 55, 4 "context": 28, 5 "time_to_first_token": 78, 6 "paint": 10, 7} 8INLINE_P95_OBJECTIVE_MS = 200 9 10total_ms = sum(STAGE_BUDGET_MS.values()) 11headroom_ms = INLINE_P95_OBJECTIVE_MS - total_ms 12 13assert total_ms <= INLINE_P95_OBJECTIVE_MS 14print("scenario_p95_budget_ms:", total_ms) 15print("headroom_ms:", headroom_ms)
Output
1scenario_p95_budget_ms: 183 2headroom_ms: 17


Architecture

The system consists of three main components: the IDE (Integrated Development Environment) extension (client), the API gateway (orchestration), and the inference engine (LLM).

The IDE Extension captures keystrokes and manages the local state (open tabs, cursor position). The Context Engine runs locally or on the gateway to select the most relevant code snippets to fit in the prompt window. The Language Server Protocol (LSP) is the standard interface between an editor and a language server, so the same semantic engine can power completion, go-to-definition, and diagnostics across multiple editors.[3]

In practice, deterministic semantic completion and LLM completion often run side by side. The language server produces symbol-aware candidates with exact type information, while the LLM produces longer infill candidates. The client can merge, rank, or gate these results depending on cursor position and confidence.

That hybrid design matters. For exact member completion after a dot, semantic candidates from the language server are often both faster and more reliable than free-form generation. The LLM earns its keep on longer spans, comments-to-code, and cases where the user intent isn't fully captured by the type system.

The Inference Server hosts the LLM and handles the heavy compute, using techniques like continuous batching to serve thousands of users simultaneously.

Local fallback lane

Not every keystroke should go through the full remote generation path. A production client usually keeps a deterministic lane for the cheap, high-confidence cases:

  • Local symbols and imported APIs: Exact member completion from the parser or language server.
  • Prefix indexes or tries: Fast keyword and snippet lookup with predictable latency.
  • Fuzzy matching: Edit-distance recovery for small typos like pritn instead of print.

That local lane does two jobs. It gives you sub-50ms suggestions for exact matches, and it provides a graceful fallback when the network is slow, the model abstains, or the user is working in a restricted environment.

By splitting the responsibilities across these layers, the architecture minimizes the volume of data sent over the network. The client handles lightweight heuristics and syntax parsing, ensuring that only highly qualified prompts reach the backend, where expensive GPU resources are dedicated strictly to token generation. The architecture visual above shows this flow from the client to the inference server and back.


Context gathering and management

The model needs enough context to make useful suggestions, but the live prompt budget is deliberately kept small because long prefills destroy latency. Even if the base model advertises a much larger context window, you can't simply stuff the whole repository into the keystroke path.

Context priority strategy

We layer context sources based on their immediate relevance to the cursor position. Since we can't fit an entire codebase into the model's context window, we must rank information strictly by its likelihood of influencing the next few tokens. The context engine builds the prompt dynamically, filling the available budget (e.g., 8k tokens) from the top priority down until space is exhausted:

PrioritySourceMethodRationale
1 (Highest)Code before cursorDirect prefixImmediate grammatical context.
2Code after cursor (suffix)FIM SuffixNeeded to close brackets, match types.
3Imports & DefinitionsStatic AnalysisTypes and functions used in the file.
4Recently Edited FilesTemporal localityCode you just touched is likely relevant.
5Neighboring FilesJaccard SimilarityFiles that share imports with current file.

Fill-in-the-middle (FIM)

Standard causal language models predict the next token based only on the past (left-to-right). In coding, you often insert code in the middle of a file. If the model ignores the suffix (the code after the cursor), it might generate valid code that conflicts with the closing braces or logic below.

Analogy: Think of standard causal modeling like typing on a typewriter: you can only add to the end of the page. FIM (Fill-in-the-Middle) is like a modern word processor: you can insert text in the middle of a sentence, and the system uses the surrounding context (both left and right) to ensure the insertion fits correctly.

FIM reorders the prompt so the model sees the suffix before generating the middle. Marker strings differ by tokenizer and model; the notation below names their roles rather than defining a universal API:

PFIM=<PRE>⋅xprefix⋅<SUF>⋅xsuffix⋅<MID>P_{\text{FIM}} = \text{<PRE>} \cdot x_{\text{prefix}} \cdot \text{<SUF>} \cdot x_{\text{suffix}} \cdot \text{<MID>}PFIM​=<PRE>⋅xprefix​⋅<SUF>⋅xsuffix​⋅<MID>

Fill-in-the-middle prompt construction where code before cursor becomes prefix, code after cursor becomes suffix, and model generates middle. Fill-in-the-middle prompt construction where code before cursor becomes prefix, code after cursor becomes suffix, and model generates middle.
FIM prompt construction gives the model both sides of the edit so generated code fits the surrounding function, tests, and closing syntax.

Building the FIM prompt

Build the input in FIM order: prefix marker + code before cursor + suffix marker + code after cursor + middle marker. Then the model generates the missing middle section.

Here is a concrete example. Imagine you're editing a warehouse routing module and your cursor sits inside an empty function body:

Prefix (code before cursor)

prefix-code-before-cursor.py
1from warehouse.routing import RoutePlanner 2 3def validate_delivery_address(order): 4 """Check if the shipping address is inside our delivery zone."""

Suffix (code after cursor)

suffix-code-after-cursor.py
1 return is_valid

FIM prompt sent to the model

text
1<PRE>from warehouse.routing import RoutePlanner 2 3def validate_delivery_address(order): 4 """Check if the shipping address is inside our delivery zone.""" 5<SUF> return is_valid 6<MID>

The model now generates the middle section, using both the docstring above and the return is_valid below to infer that it should write a delivery-zone check, not a payment validator. Without the suffix, it could generate code that never produces is_valid, leaving the following line broken.

Before wiring up a real tokenizer, test the transformation itself. A serving adapter would replace these readable markers with the exact sentinel tokens required by its selected FIM-capable model.

format-fim-request.py
1def format_fim(prefix: str, suffix: str) -> str: 2 return f"<PRE>{prefix}<SUF>{suffix}<MID>" 3 4prefix = "def delivery_zone(order):\n is_valid = " 5suffix = "\n return is_valid\n" 6prompt = format_fim(prefix, suffix) 7 8assert prompt.endswith("<MID>") 9assert prompt.index("<SUF>") < prompt.index("return is_valid") 10print(prompt.replace("\n", "\\n"))
Output
1<PRE>def delivery_zone(order):\n is_valid = <SUF>\n return is_valid\n<MID>

FIM gives a decoder-only model suffix information without changing the left-to-right decoding mechanism. Bavarian et al. found that transformed training data can add infilling ability while preserving ordinary generation performance in their experiments.[4]

Repository-level context

To handle large repos, we can't rely on simple file buffering. We use two lightweight retrieval methods that run fast enough to stay inside the keystroke budget.

  1. Jaccard Similarity: Calculate the intersection of unique tokens (variable names, imports) between the current file and other open files. High overlap means high relevance. For example, if the active file route_planner.py imports RoutePlanner, AddressValidator, and DeliveryWindow, and address_utils.py shares two of those three names, its Jaccard score is 2/4 = 0.5 (high enough to pull in a few symbol definitions from it).
  2. BM25 / Sparse Retrieval: A lightweight keyword search over the local repo index to find defining files for classes used in the current buffer.[5]

Dense vector retrieval is often kept off the hottest keystroke path unless it is cached or precomputed. Once you include lookup, reranking, and prompt assembly, sparse search or heuristic graph traversal can be easier to keep inside a small budget. The key is to pre-build a small symbol graph at editor startup rather than scanning a repository on every edit.

This tiny selector makes the overlap heuristic concrete. It retrieves only a definition-bearing file that shares the active symbols; unrelated code is excluded from the live prompt:

select-nearby-context.py
1active_symbols = {"RoutePlanner", "AddressValidator", "DeliveryWindow"} 2candidate_files = { 3 "address_utils.py": {"AddressValidator", "DeliveryWindow", "PostalCode"}, 4 "billing.py": {"Invoice", "Payment"}, 5 "planner.py": {"RoutePlanner", "Truck"}, 6} 7 8def jaccard(left: set[str], right: set[str]) -> float: 9 return len(left & right) / len(left | right) 10 11ranked = sorted( 12 ((jaccard(active_symbols, symbols), path) for path, symbols in candidate_files.items()), 13 reverse=True, 14) 15selected = [path for score, path in ranked if score >= 0.25] 16 17assert selected == ["address_utils.py", "planner.py"] 18print("selected_context:", selected)
Output
1selected_context: ['address_utils.py', 'planner.py']


Model serving for low latency

Serving code models requires optimizing for Time-To-First-Token (TTFT).

Speculative decoding

Speculative decoding reduces latency by using a smaller, faster "draft" model to predict the next KKK tokens, which are then verified in parallel by a larger, more accurate "target" model.[6] Modern serving stacks expose this as an operational feature with workload-specific caveats, so it still needs acceptance-rate measurement before rollout.[7]

Think of speculative decoding as a senior engineer supervising a fast draft. The draft model proposes a few tokens, and the target model verifies them in parallel. If the target agrees, the client can stream several tokens after one target pass. If the draft misses, the target supplies the correction.

Speculative decoding flow where a small draft model proposes tokens, a larger target model verifies them, accepted tokens stream as ghost text, and rejected tokens fall back to the target output. Speculative decoding flow where a small draft model proposes tokens, a larger target model verifies them, accepted tokens stream as ghost text, and rejected tokens fall back to the target output.
Speculative decoding helps when code boilerplate is predictable enough for the draft model to earn back its overhead.

The process works in three stages:

Draft model generation

The smaller draft model cheaply predicts candidate tokens. For example, it might generate def calculate_delivery_eta(order): based on the surrounding class and import context.

Target model verification

The larger target model runs one forward pass on the sequence and verifies the draft tokens.

Acceptance or correction

If the target model agrees with a prefix of the draft, the server emits that verified prefix without separate target decode steps for each accepted token. If there's a mismatch, generation proceeds from the target model's corrected token.

It's particularly effective for code because code structure is highly repetitive (e.g., standard imports, boilerplate loops), making it easy for small models to guess correctly.

A concrete latency example

Suppose the draft model is 5x faster than the target model because it has fewer layers and smaller weights. The draft model guesses 5 tokens ahead. The target model verifies all 5 in a single parallel pass:

  • If 4 out of 5 tokens match: You produced 4 tokens from one target-model verification pass plus the cheap draft work (5 draft steps cost about one target step at 5x). That is roughly 4 useful tokens for the price of about 2 target steps, near a 2x speedup, and the exact gain depends on draft cost and acceptance length.
  • If only 1 out of 5 matches: You accept that single token and discard the rest. Net speedup is small or even slightly negative once draft overhead is counted, but you still make progress without a full target step per token.
  • Worst case: The draft model misses the very first token. The target model outputs its own token, and you fall back to normal single-token generation for that step.

Repetitive code such as import blocks or standard exception handling may give a draft model useful acceptance rates. Treat that as a rollout hypothesis, not a property of all code workloads: measure accepted draft length and end-to-end latency by language and suggestion type before enabling it broadly.[6]

The rollout gate can be expressed as a small policy. Here, Python boilerplate improves latency enough to enable speculation, while a low-acceptance configuration remains on ordinary decoding:

gate-speculative-decoding.py
1measurements = { 2 "python_imports": {"mean_accepted_tokens": 3.8, "baseline_p95_ms": 178, "spec_p95_ms": 136}, 3 "sql_queries": {"mean_accepted_tokens": 1.1, "baseline_p95_ms": 169, "spec_p95_ms": 176}, 4} 5 6def enable_speculation(sample: dict[str, float]) -> bool: 7 latency_gain_ms = sample["baseline_p95_ms"] - sample["spec_p95_ms"] 8 return sample["mean_accepted_tokens"] >= 2.0 and latency_gain_ms >= 10 9 10enabled = [name for name, sample in measurements.items() if enable_speculation(sample)] 11assert enabled == ["python_imports"] 12print("speculative_decode_enabled_for:", enabled)
Output
1speculative_decode_enabled_for: ['python_imports']

KV cache reuse (prefix caching)

Developers often type, pause, and type again in the same file. The file's header (imports, class definitions, and previously written functions) remains constant across these rapid sequential interactions.

Instead of prefilling the same stable file header on every request, the server can cache Key-Value (KV) blocks for a shared prefix. When the next prompt begins with the same cacheable blocks under the same tenant and model policy, it reuses those blocks and prefills only the uncached delta. Automatic prefix caching does not make generation of new output tokens cheaper; it removes duplicate prefill work for reused input context.[8]

Why this matters in practice

Next requestReusable prefix workRemaining work
Same stable header, a few characters appendedSkip prefill for matching cached blocksPrefill new input delta, then decode suggestion tokens
Edit near the top of the fileOnly blocks before the changed point can matchPrefill from first changed block onward, then decode
Different tenant, model, tokenizer, or cache policyNo permitted reuseFull prompt prefill, then decode

The savings therefore depend on stable prompt prefixes, block matching, routing affinity, and isolation rules. A long stable header with a small cursor-adjacent delta is a good candidate. A request that changed early context or crosses a tenant boundary is a miss by design.

Prefix cache reuse for code completion where consecutive requests in the same file share imports and function definitions, route to the same GPU shard, and reuse cached KV blocks. Prefix cache reuse for code completion where consecutive requests in the same file share imports and function definitions, route to the same GPU shard, and reuse cached KV blocks.
Prefix-aware routing keeps rapid edits on shards that can reuse stable file headers instead of repeating prefill work.

Use a cache key that enforces isolation as well as affinity. The example reuses a stable header for the same tenant and model, but never treats another tenant's identical text as a hit:

prefix-cache-contract.py
1def cache_key(tenant: str, model: str, tokenizer: str, prefix: str) -> tuple[str, str, str, str]: 2 return tenant, model, tokenizer, prefix 3 4stable_prefix = "from warehouse.routing import RoutePlanner\n" 5cached = { 6 cache_key("shopflow", "code-fim-v3", "tok-v3", stable_prefix): "kv-block-91", 7} 8 9same_scope = cache_key("shopflow", "code-fim-v3", "tok-v3", stable_prefix) 10other_tenant = cache_key("marketplace-b", "code-fim-v3", "tok-v3", stable_prefix) 11 12assert cached.get(same_scope) == "kv-block-91" 13assert cached.get(other_tenant) is None 14print("same_scope_hit:", same_scope in cached) 15print("cross_tenant_hit:", other_tenant in cached)
Output
1same_scope_hit: True 2cross_tenant_hit: False

This can eliminate repeated prefill work on sequential edits when the prefix stays stable and the reuse boundary is valid.

Quantization

Code models often tolerate post-training quantization, but you still have to measure acceptance and retained-edit metrics after compression. Running model weights in INT8 or even INT4 precision can reduce model-memory bandwidth demands by roughly 2-4x, which materially helps completion workloads.[9][10]

This reduction is critical for deployment economics. Completion traffic is usually memory-bandwidth-bound rather than pure-FLOP-bound, so quantization can improve both throughput and TTFT. In practice, it can make the difference between serving a mid-sized code model on one 24 GB to 48 GB GPU versus needing multi-GPU model parallelism, depending on the KV-cache budget and context length.

Techniques like GPTQ (General-purpose Quantization) allow this compression to happen post-training. This means you can take off-the-shelf open weights and compress them for your specific hardware limits without needing to run an expensive fine-tuning pass.


Request lifecycle and user experience

The client (IDE) needs request control. Sending remote work for every edit event increases cancelled work and can exhaust the service's latency capacity during bursts. Instead, the client manages requesting, waiting, and cancelling based on the current editor state.

Debouncing and cancellation

We implement a dynamic debounce strategy:

  • 0ms delay on "trigger characters" (e.g., ., (, \n).
  • 150ms delay on standard typing.

The client-side RequestManager serves as the gatekeeper, deciding when to hit the server and when to wait. It takes individual keystrokes as input and either triggers an immediate API request or schedules a delayed one based on the character type. The output is a controlled stream of API requests to the server, preventing network congestion. The implementation below demonstrates this debouncing logic. The on_type method is called on every keystroke: if the character is a trigger character (., (, \n), it fires a request immediately. For ordinary characters, it cancels any pending request and schedules a new one with a 150ms delay. It also tags each request with a monotonically increasing ID so late responses can be ignored safely:

debouncing-and-cancellation.py
1import asyncio 2 3class RequestManager: 4 """Manages debounce, cancellation, and stale-response suppression.""" 5 6 def __init__(self): 7 self.pending_task: asyncio.Task[None] | None = None 8 self.latest_request_id = 0 9 self.trigger_chars = {'.', '(', '\n'} 10 self.sent_requests: list[int] = [] 11 self.shown_suggestions: list[str] = [] 12 13 async def on_type(self, char: str) -> None: 14 if self.pending_task and not self.pending_task.done(): 15 self.pending_task.cancel() 16 17 delay_sec = 0.0 if char in self.trigger_chars else 0.15 18 self.latest_request_id += 1 19 request_id = self.latest_request_id 20 self.pending_task = asyncio.create_task( 21 self.debounce_fetch(delay_sec, request_id) 22 ) 23 24 async def debounce_fetch(self, delay_sec: float, request_id: int) -> None: 25 try: 26 await asyncio.sleep(delay_sec) 27 suggestion = await self.fetch_completion(request_id) 28 self.handle_response(request_id, suggestion) 29 except asyncio.CancelledError: 30 pass 31 32 async def fetch_completion(self, request_id: int) -> str: 33 # Real clients also attach request_id to an abortable HTTP request. 34 self.sent_requests.append(request_id) 35 return f"completion-{request_id}" 36 37 def handle_response(self, request_id: int, suggestion: str) -> None: 38 if request_id == self.latest_request_id: 39 self.show_ghost_text(suggestion) 40 41 def show_ghost_text(self, suggestion: str) -> None: 42 self.shown_suggestions.append(suggestion) 43 44async def demo() -> None: 45 manager = RequestManager() 46 await manager.on_type('r') 47 await asyncio.sleep(0.05) 48 await manager.on_type('o') 49 await asyncio.sleep(0.05) 50 await manager.on_type('.') 51 52 if manager.pending_task: 53 await manager.pending_task 54 55 manager.handle_response(2, "late-completion-2") 56 print("sent_requests:", manager.sent_requests) 57 print("shown_suggestions:", manager.shown_suggestions) 58 59asyncio.run(demo())
Output
1sent_requests: [3] 2shown_suggestions: ['completion-3']

In a production editor, you usually combine both layers: abort the HTTP request when possible, and still guard UI updates with request IDs in case the server races or ignores the cancellation.

Key features of modern systems

Modern coding systems expose features that go far beyond predicting the next word:

  • Ghost text: Inline suggestions that appear in gray directly in the editor as you type. If the user types a character matching the suggestion prefix, the client can keep the remaining text visible. An accept action such as pressing Tab inserts the remaining suggestion.[11]
  • Natural language to code: Writing a comment like // validate address against delivery zones and having the code appear automatically.
  • Repo-wide reasoning: The ability to ask "Where is the authentication logic handled?" and get a code suggestion that fits that specific architecture.
  • Unit test generation: Automatically writing tests for the code it just helped you create.

Evaluation metrics

How do we know if the system is good? Unlike typical conversational LLMs where quality is subjective and hard to measure automatically, code completion provides immediate, objective feedback: the user either accepts the suggestion or they don't. However, relying solely on simple acceptance isn't enough to capture the true value provided by the system.

A useful evaluation framework measures both the frequency of helpful suggestions and the work they retain, while checking latency objectives. Tracking these metrics across languages, suggestion classes, and traffic periods can reveal model regressions or infrastructure bottlenecks hidden by aggregates.

Code completion quality dashboard showing shown rate, acceptance rate, retained character rate, measured inline latency, and stale request cancellation. Code completion quality dashboard showing shown rate, acceptance rate, retained character rate, measured inline latency, and stale request cancellation.
Useful completion metrics reward retained work and freshness, not only raw acceptance clicks.
MetricDefinitionWhy it matters
Acceptance Rate% of shown suggestions inserted by the user.GitHub's published study reported a 27% acceptance rate in its sample and found acceptance rate best predicted perceived productivity among its usage measurements. It is still gameable with tiny safe suggestions, so pair it with value metrics.[12]
Accepted-and-retained charactersCharacters from accepted suggestions that still remain after a chosen observation window.Captures value that survives editing, not just the click. The GitHub study measured unchanged and mostly unchanged completion persistence at several time windows, reinforcing why retention complements acceptance.[12]
Completion Shown Rate% of eligible requests that actually surface a suggestion.A model that abstains too often won't feel helpful even if the few suggestions it shows are accurate.
Latency P9999th percentile (P99) response time.Slow suggestions break flow. The tail matters as much as the median.

Offline evaluation on benchmarks like HumanEval[13] is useful for catching model regressions, but it doesn't measure editor timing, shown-rate policy, or accepted edits that users later undo. An online experiment is needed to determine whether a model or context heuristic improves the product experience.

Optimizing solely for acceptance rate can lead to a model that only suggests short, obvious tokens like closing parens because they are safe. You need value metrics such as retained characters, not just click-through.

The metric computation should retain that distinction. In this example, three accepted suggestions become only one meaningfully retained edit:

measure-retained-completions.py
1suggestions = [ 2 {"shown": True, "accepted_chars": 18, "retained_chars": 0}, 3 {"shown": True, "accepted_chars": 4, "retained_chars": 4}, 4 {"shown": True, "accepted_chars": 42, "retained_chars": 35}, 5 {"shown": True, "accepted_chars": 0, "retained_chars": 0}, 6] 7 8shown = len(suggestions) 9accepted = sum(item["accepted_chars"] > 0 for item in suggestions) 10accepted_chars = sum(item["accepted_chars"] for item in suggestions) 11retained_chars = sum(item["retained_chars"] for item in suggestions) 12 13acceptance_rate = accepted / shown 14retained_char_rate = retained_chars / accepted_chars 15assert acceptance_rate == 0.75 16assert round(retained_char_rate, 3) == 0.609 17print("acceptance_rate:", f"{acceptance_rate:.0%}") 18print("retained_char_rate:", f"{retained_char_rate:.1%}")
Output
1acceptance_rate: 75% 2retained_char_rate: 60.9%


Model architecture choices

Choosing the right model for code completion involves a trade-off between quality and latency. A more capable model may produce stronger multi-line candidates but arrive after the user has moved on. A fast model that emits incorrect syntax or APIs also fails the product objective.

The usual production pattern is to keep the keystroke path on a smaller, specialized model and reserve slower, more capable models for bigger edits or chat-style tasks.

Small vs. large models

A practical design tests smaller, completion-focused models on the keystroke path and reserves higher-latency model routes for broader edits. In practice, those roles often separate: inline-completion candidates tuned for FIM and low TTFT, and more capable general code models for harder multi-file work. Evaluate the split rather than assuming model size alone predicts usefulness. Useful techniques include:

  • Code-specific pre-training and fine-tuning: Models such as Qwen2.5-Coder[14] show how code-focused pre-training plus completion-specific tuning can make smaller models punch above their size on code-generation benchmarks.
  • Speculative decoding: Pair a small draft model with a larger target model so the target verifies several tokens at once.
  • Distillation: Training small models using outputs from larger models as training signal, transferring capability at lower inference cost.

Fill-in-the-middle training details

FIM-capable models require a specific training objective. During pre-training, a subset of training examples undergoes a transformation:

  1. Split the sequence at a random point into (prefix, middle, suffix).
  2. Reorder to (prefix, suffix, middle) with special sentinel tokens between segments.
  3. Train the model to predict the middle segment given (prefix, suffix).

The two dominant FIM formats are:

  • PSM (Prefix-Suffix-Middle): <PRE>prefix<SUF>suffix<MID>middle. This is the most common format.
  • SPM (Suffix-Prefix-Middle): <SUF>suffix<PRE>prefix<MID>middle. Here the prefix and the generated middle form one contiguous span, which makes continuation slightly more natural and tends to score marginally higher on infilling benchmarks.

Bavarian et al. found that jointly training both formats transfers positively, and that a high FIM rate (they tested up to 90%) costs little or nothing on ordinary left-to-right generation, the "FIM-for-free" result.[4] In practice teams still tune the mix empirically, because too much infilling-specific data can shift behavior on plain continuation. They also recommend applying the FIM split at the character level so the model stays robust when the cursor lands in the middle of a token.

Multi-line vs. single-line suggestions

The system should dynamically decide how much code to suggest based on the current context and the likelihood of generating a complete logical block. Generating too much code when the user only wants a single word is distracting and wastes compute, while generating too little inside a new function body defeats the purpose of the tool:

TriggerSuggestion TypeScenario p95 objective
Mid-expression (after ., ()Type-informed local suggestion first60 ms
End of lineSingle line completion200 ms
After function signatureMulti-line body500 ms
Empty line in a functionMulti-line block500 ms

A lightweight classifier on the client side can select a request class before sending remote work, allowing the server to use different models or decoding strategies per type. For example, multi-line completions might be routed to a more capable model, while exact member completion remains local.

route-completion-request.py
1def route(trigger: str, after_signature: bool) -> tuple[str, int]: 2 if trigger in {".", "("}: 3 return "LOCAL_SEMANTIC", 60 4 if after_signature: 5 return "REMOTE_MULTILINE", 500 6 return "REMOTE_INLINE", 200 7 8cases = [ 9 route(".", False), 10 route("\n", True), 11 route("r", False), 12] 13assert cases == [ 14 ("LOCAL_SEMANTIC", 60), 15 ("REMOTE_MULTILINE", 500), 16 ("REMOTE_INLINE", 200), 17] 18print("routes:", cases)
Output
1routes: [('LOCAL_SEMANTIC', 60), ('REMOTE_MULTILINE', 500), ('REMOTE_INLINE', 200)]


Production scaling

Serving an LLM is compute-intensive, and code completion adds high-churn requests that can become stale while a developer types. Provisioning and admission control should preserve measured latency objectives during traffic spikes.

Many completion requests are short or cancelled before display. Measuring that churn makes request suppression, prefix reuse, and abstention policies concrete cost decisions rather than assumed wins.

GPU fleet management and load balancing

Running code completion for a global developer fleet requires GPU provisioning and routing that balance latency, cache affinity, and isolation. The gateway can distribute requests using several strategies:

  • Prefix-aware routing: Route requests that share long prompt prefixes to the same GPU instance to maximize KV cache hit rates. If Developer A is editing src/auth/handler.py, subsequent requests for that file should hit the same shard.
  • Geographic routing: Prefer an approved nearby GPU region when residency and capacity permit. In this scenario, a 50 ms network round trip consumes one quarter of the 200 ms objective.
  • Model tiering: Use smaller models for single-token completions and larger models for multi-line suggestions, routing based on the expected suggestion type.

Continuous batching

Waiting to fill a fixed batch can add queue delay before first-token work begins. Continuous batching allows requests to join and leave an active serving schedule. Frameworks such as vLLM[15] and TGI[16] support optimized serving features; in vLLM, PagedAttention manages KV-cache storage in blocks. Tune batching against TTFT and throughput because higher utilization can still harm inline latency when queues grow.

Cost economics

At scale, code completion is expensive because the IDE emits a steady stream of short-lived requests while the user is typing. The key design question isn't just cost per token. It's cost per useful suggestion.

  • High churn: Many requests are cancelled before the user ever sees the output.
  • Prefill-heavy workload: Much of the cost sits in reading context, not in generating long responses.
  • Useful north-star metric: Track cost per accepted-and-retained suggestion, not just cost per request.

Prefix caching, request cancellation, and better abstention policies improve both cost and user experience at the same time.


Security, challenges, and ethics

Code completion systems may process proprietary source code, configuration, credentials accidentally present in buffers, and repository metadata. Privacy and security architecture therefore determine what the product is allowed to see and retain.

A useful design begins with explicit contracts: which buffers may be uploaded, whether prompts can be logged or used for training, how long operational data is retained, and which controls prevent cross-tenant reuse.

Data isolation

For an enterprise configuration that forbids code reuse across organizations, isolate tenant data across request ingestion, telemetry, and caches:

  • No training on user code by default. Model weights are frozen at deployment. User code is processed transiently for inference only.
  • Prompt data retention: Define explicit, minimal retention policies. Enterprises often require no raw-code retention or short, auditable windows for operational logs.
  • Tenant isolation: In enterprise deployments, ensure that one organization's code context never leaks into another's KV cache or batch. In practice that usually means cache partitioning, tenant-aware batching, and auditable isolation boundaries.

The isolation contract should be executable. A logging policy can retain safe metadata for latency debugging while dropping raw source by default:

apply-prompt-retention-policy.py
1def audit_record(request: dict[str, str], retain_raw_code: bool) -> dict[str, str]: 2 record = { 3 "tenant": request["tenant"], 4 "model": request["model"], 5 "latency_bucket": request["latency_bucket"], 6 } 7 if retain_raw_code: 8 record["prompt"] = request["prompt"] 9 return record 10 11request = { 12 "tenant": "shopflow", 13 "model": "code-fim-v3", 14 "latency_bucket": "p95_under_200ms", 15 "prompt": "API_TOKEN='secret-value'", 16} 17record = audit_record(request, retain_raw_code=False) 18 19assert "prompt" not in record 20print("audit_fields:", sorted(record))
Output
1audit_fields: ['latency_bucket', 'model', 'tenant']

PII redaction

For enterprise usage, client-side redaction can reduce the chance that recognized secrets leave the developer's machine. Pattern and entropy-based scanners can catch some API keys, tokens, and passwords, while policy-based filters can mask selected identifiers before prompt construction. Scanners are incomplete, so redaction complements upload controls, restricted logging, access control, and incident response.

If a policy promises that matching secret patterns won't be uploaded, replacement must run before upload. The system replaces matched strings with placeholders such as <API_KEY> before constructing the remote prompt.

When the server returns a generated completion, the client should only reinsert masked values if the placeholder maps to a known local value. Otherwise it should keep the placeholder visible and require an explicit user edit. That prevents the model from inventing a secret-looking string and having the client silently treat it as real.

mask-known-secrets-before-upload.py
1import re 2 3TOKEN = re.compile(r"sk_live_[A-Za-z0-9]+") 4 5def redact(text: str) -> tuple[str, dict[str, str]]: 6 mapping: dict[str, str] = {} 7 def replace(match: re.Match[str]) -> str: 8 placeholder = f"<API_KEY_{len(mapping) + 1}>" 9 mapping[placeholder] = match.group(0) 10 return placeholder 11 return TOKEN.sub(replace, text), mapping 12 13prompt, local_mapping = redact("client = API('sk_live_abc123')") 14assert "sk_live_" not in prompt 15assert local_mapping["<API_KEY_1>"] == "sk_live_abc123" 16print("upload_prompt:", prompt)
Output
1upload_prompt: client = API('<API_KEY_1>')

On-premises deployment

Some enterprise policies prohibit sending source code to a shared external service. A product serving those customers may need private Virtual Private Cloud (VPC), self-hosted, or offline deployment options.

  • Self-hosted models: Deploy suitable code models inside customer-controlled infrastructure. This changes trust boundaries, but still requires identity, network, logging, and supply-chain controls.
  • Air-gapped environments: Support fully offline operation for classified environments. The context engine, model, and inference server all run locally.

The mastery gap and ethical considerations

A balanced look at code completion must address the potential downsides. A common concern is the "mastery gap": if developers accept suggestions without understanding them, their debugging and design instincts can weaken over time.

On the positive side, code completion cuts repetitive boilerplate, lowers the memorization burden of syntax and API details, and acts as live documentation for unfamiliar libraries. It also makes programming more approachable for newcomers and non-native English speakers.

On the negative side, the model can hallucinate libraries that don't exist or emit deprecated syntax. It might accidentally suggest code with security vulnerabilities, or reproduce copyrighted fragments from its training data, creating legal exposure. Telemetry and shared training pipelines can also leak proprietary code if the privacy architecture is weak. Finally, over-reliance is a real risk: junior developers who stop learning how the code works may find their debugging and design instincts weakening over time.


Mastery check

You should be able to defend these design choices clearly:

  • Foundational: Design a priority-based context builder using cursor position, open files, suffix context, imports, and recently edited files.
  • Intermediate: Explain when deterministic semantic completion should beat LLM generation, including offline and low-latency fallback paths.
  • Intermediate: Specify an inline latency budget around 200 ms with explicit client, network, context assembly, inference, and rendering stages.
  • Advanced: Explain FIM training and why cursor insertion needs suffix context.
  • Advanced: Use speculative decoding, quantization, prefix caching, or smaller model tiers to meet inference-latency targets.
  • Advanced: Define quality metrics beyond acceptance rate, including shown rate, accepted-and-retained characters, stale-response rate, and latency P99.
  • Advanced: Implement client-side debouncing, cancellation, and request-ID gating to prevent stale UI updates and server floods.

Evaluation rubric

  • Foundational: Designs a context builder that prioritizes prefix, suffix, imports, recent edits, and nearby files.
  • Intermediate: Explains why semantic completion should beat the LLM after exact trigger characters.
  • Intermediate: Breaks inline latency into client, network, context assembly, inference, and render budgets.
  • Advanced: Explains FIM training and why suffix context matters for insertion.
  • Advanced: Chooses speculative decoding, prefix caching, quantization, or smaller model tiers based on latency bottleneck.
  • Advanced: Uses accepted-and-retained work, shown rate, stale-response rate, and latency tails instead of raw acceptance alone.
  • Advanced: Designs cancellation and request-ID gating so stale results never overwrite fresher editor state.

Follow-up questions

Common pitfalls

  • Symptom: Suggestions appear after the user already typed past them. Cause: No real cancellation path, or UI trusts arrival order instead of request IDs. Fix: Abort in-flight work when possible and gate rendering on newest request ID.

  • Symptom: Completions are often exact but still feel unhelpful. Cause: System optimizes for raw acceptance with tiny safe suggestions. Fix: Track accepted-and-retained characters and shown rate, not acceptance alone.

  • Symptom: Member completion is slower and less accurate after . than IDE autocomplete used to be. Cause: Every keystroke is routed to the LLM instead of keeping semantic lane for deterministic cases. Fix: Let parser or language server own exact symbol completion and reserve GPU work for open-ended spans.

  • Symptom: Prefix caching hit rate stays low even though users edit the same file repeatedly. Cause: Routing breaks shard affinity, so matching prefixes miss the cached KV blocks. Fix: Add prefix-aware routing keyed by tenant, model, and stable prompt prefix.

  • Symptom: Inserted code fights the code below the cursor. Cause: System ignores suffix context or uses left-to-right continuation where infill is required. Fix: Use FIM prompt formatting and FIM-trained models for in-file edits.

  • Symptom: Model quality looks strong in offline code benchmarks but users still dislike the product. Cause: Benchmarks miss stale-response behavior, latency tails, abstention policy, and IDE interaction friction. Fix: Pair offline evals with online product metrics and real editor A/B tests.


Conclusion

Code completion changes part of the user's work from typing to reviewing suggestions. Useful systems gather bounded context, reuse permitted computation, suppress stale output, and measure whether retained edits justify their latency and data access.

Key takeaways

  • Latency is a product contract. In this scenario, inline context building, networking, inference, and rendering are budgeted against a 200 ms p95 objective.
  • Context is hierarchical. Prioritize the immediate cursor vicinity, then the file, then related imports. Dense retrieval on the keystroke path is usually too slow unless it's heavily optimized.
  • FIM (Fill-in-the-Middle) uses both sides of an edit. A FIM-capable model can use (Prefix, Suffix) context to generate a fitting middle span.[4]
  • Speculative decoding and KV cache prefix sharing are common optimizations for serving code models efficiently.[6][15]
  • Debouncing and cancellation reduce work that would arrive stale.
  • Privacy controls are explicit. Upload, retention, training-use, cache-isolation, and deployment policies must match the customer's data contract.

You have now designed a low-latency, context-aware code completion system. The principles of hierarchical context construction, KV-cache prefix reuse, debouncing + cancellation, speculative decoding, and FIM training are the exact foundation for every high-frequency, low-latency inference surface you will build.

You can now defend a measured inline-completion design: context selection, infill formatting, freshness gates, prefill reuse, and privacy contracts. The next capstone turns that single-product serving path into a shared multi-tenant GPU platform with isolation, budgets, and safe rollouts.

Next Step
Continue to Multi-Tenant LLM Platform

Code completion gave you the single-tenant, ultra-low-latency serving fundamentals (context management, KV reuse, speculative decoding). Now you will design the production platform that hosts many specialized models on shared GPUs, enforces strong isolation, per-tenant token budgets, cost attribution, and safe canary rollouts.

PreviousContent Moderation System
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

GitHub Copilot code suggestions in your IDE

GitHub · 2026

GitHub Copilot cloud agent

GitHub · 2026

Language Server Protocol

Microsoft · 2026

Efficient Training of Language Models to Fill in the Middle.

Bavarian, M., et al. · 2022 · arXiv preprint

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

Speculative Decoding

vLLM Team · 2026 · vLLM Documentation

Automatic Prefix Caching

vLLM · 2026

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

Frantar, E., et al. · 2023 · ICLR 2023

GPTQ

Hugging Face · 2026

Programmatic Language Features: Show Inline Completions

Visual Studio Code · 2026

Measuring GitHub Copilot's Impact on Productivity

Ziegler, A., et al. · 2024

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

Qwen2.5-Coder Technical Report

Qwen Team, Alibaba Group · 2024 · arXiv preprint

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Text Generation Inference.

Hugging Face · 2026