LearnSystem Design CapstonesCode Completion System

🏗️HardSystem Design

Code Completion System

Design a real-time code completion path with context construction, measured serving latency, privacy controls, and stale-result suppression.

42 min read

Learning path

Step 147 of 158 in the full curriculum

Content Moderation System Multi-Tenant LLM Platform

Content moderation designed a low-latency safety pipeline that decides whether content can enter the product. Code completion uses many of the same production muscles, but the pressure point changes: every keystroke can create a fresh inference request, and stale answers are worse than no answer.

A code completion system predicts useful edits under a latency budget tight enough to run while a developer is still typing. This design chapter covers context collection, ranking, serving, privacy, and feedback loops for developer tools.

In-editor code editing is latency-sensitive: the next conditional, test assertion, or API call has to appear before the developer moves to another line. Code completion tools can provide dimmed inline suggestions as a developer types, and can also offer next-edit suggestions.^{[1]Reference 1GitHub Copilot code suggestions in your IDEhttps://docs.github.com/en/copilot/concepts/completions/code-suggestions} Building that experience means balancing speed, context, and data handling. In the design scenario below, the inline path has a 200 ms p95 service-level objective (SLO); a real product must set and validate its own objective from user research and telemetry.

Code completion request path where an editor snapshot first tries a fast local lane for exact symbols, then falls back to a remote LLM lane for open-ended code, and only fresh results render as ghost text. — Code completion is a freshness system: local semantic answers should win exact cases, while remote generation only runs when richer code is worth latency.

Code completion is no longer only a popup list of symbols. The product may still use exact local lookups, but it can also generate a line or logic block from surrounding code.

Modern coding assistants now span inline completion (suggestions at the cursor), next edit suggestions (updating nearby code without moving the cursor), and full coding agents (systems that can plan and execute multi-file changes from a higher-level request).

By this point in the curriculum, you've already seen how Transformers predict the next token and how the KV cache avoids recomputing shared work. Now put those ideas into a product: a code completion system that may consider each edit event, but only sends qualified requests and only displays fresh results.

The evolution of code completion

A short history makes modern completion easier to reason about.

Capability layer	Technology	Best at
Lexical completion	Prefix indexes, keywords, and symbol tables	Local identifier and keyword completion
Semantic completion	Parsing, abstract syntax trees (ASTs), and type inference	Type-aware method suggestions
Inline generation	Code-trained Transformers	Lines and logic blocks under a latency budget
Next-edit suggestions	Recent edits plus surrounding context	Predicting a nearby follow-up edit
Coding agents	Retrieval, tool use, and execution loops	Multi-file changes, tests, and terminal actions

Static and rule-based

Early completion engines were mostly lexical. They used token scanners, prefix indexes, symbol tables, and simple scope rules. That was enough to complete local variables and imported symbols, but not enough to understand intent.

Semantic completion

IntelliSense popularized semantic completion in mainstream IDEs. Under the hood, modern semantic engines combine parsing, symbol tables, type inference, and compiler services. They know request.scope is a string field on a concrete type, rather than another token that happens to follow a dot.

Inline generation

LLM-based assistants added a generative lane. Instead of returning only symbols that a parser already knows, they can predict an entire line or block that fits surrounding code.

Agent workflows

Coding products can also combine generation with broader agent workflows such as repository research, planning, terminal commands, and multi-file edits.^{[2]Reference 2GitHub Copilot cloud agenthttps://docs.github.com/en/copilot/concepts/agents/cloud-agent/about-cloud-agent} That's a layer above plain autocomplete. Inline suggestions still depend on a narrower foundation: fast context gathering, good candidate generation, and tight latency control.

System requirements

An inline suggestion competes with the user's next keystroke. If it arrives too late, the user has already written past it. If it's irrelevant, it becomes a distraction rather than an aid.

Designing for this environment requires balancing competing priorities: providing the model with enough context to be accurate while executing the inference quickly enough to be useful. The key constraints include:

Latency: For this scenario, keep single-line inline completion under 200 ms at p95; permit a larger separately measured budget for multi-line suggestions.
Context: Must use open files, imports, function signatures, and project structure, not the current file alone.
Quality: Healthy acceptance rates plus strong accepted-and-retained character metrics. The exact target varies by language, editor UX, and how aggressively the client decides to show suggestions.
Scale: Large global developer fleets with strong diurnal traffic spikes, requiring efficient GPU utilization.
Privacy: Define what code may leave the client, what is retained, whether training use is disabled by default, and how tenants are isolated.

Meeting these requirements requires more than exposing a chat endpoint. The full pipeline, from the client-side edit listener to the inference engine, needs measurement for time-to-first-token (TTFT). TTFT measures the delay from request submission until the first response token arrives.

One practical way to reason about the budget is to split it across stages: editor + network overhead, context assembly, first-token inference, and UI rendering. The exact numbers vary by region and model size, but every stage is on the clock.

Stage	Typical budget	Notes
Client event handling + local parse	5-15ms	Capture the keystroke, cursor position, and lightweight syntax state.
Network round trip	20-70ms	Depends heavily on region and whether the request stays close to the user.
Context assembly	10-40ms	Build the prompt, gather nearby symbols, and fetch a few related files.
First-token inference	40-90ms	Usually the hardest budget to hit because prefill dominates.
UI render	5-15ms	Paint ghost text and avoid jank in the editor.

For the scenario below, those stage budgets fit under a 200 ms p95 objective. Multi-line suggestions can have a different objective, but they still need cancellation and freshness checks.

Use an executable budget check instead of treating a latency target as a promise. This small calculation fails the candidate path when any stage pushes total latency over the scenario objective:

latency-budget-check.py

STAGE_BUDGET_MS = {
    "client_parse": 12,
    "network": 55,
    "context": 28,
    "time_to_first_token": 78,
    "paint": 10,
}
INLINE_P95_OBJECTIVE_MS = 200

total_ms = sum(STAGE_BUDGET_MS.values())
headroom_ms = INLINE_P95_OBJECTIVE_MS - total_ms

assert total_ms <= INLINE_P95_OBJECTIVE_MS
print("scenario_p95_budget_ms:", total_ms)
print("headroom_ms:", headroom_ms)

Output

scenario_p95_budget_ms: 183
headroom_ms: 17

Architecture

The system has three main components: the IDE (Integrated Development Environment) extension (client), the API gateway (orchestration), and the inference engine (LLM).

The IDE Extension captures keystrokes and manages the local state (open tabs, cursor position). A Context Engine runs locally or on the gateway to select the most relevant code snippets to fit in the prompt window. The Language Server Protocol (LSP) is the standard interface between an editor and a language server, so the same semantic engine can power completion, go-to-definition, and diagnostics across multiple editors.^{[3]Reference 3Language Server Protocolhttps://microsoft.github.io/language-server-protocol/}

In practice, deterministic semantic completion and LLM completion often run side by side. The language server produces symbol-aware candidates with exact type information, while the LLM produces longer infill candidates. The client can merge, rank, or gate these results depending on cursor position and confidence.

That hybrid design matters. For exact member completion after a dot, semantic candidates from the language server are often both faster and more reliable than free-form generation. The LLM earns its keep on longer spans, comments-to-code, and cases where the user intent isn't fully captured by the type system.

The Inference Server hosts the LLM and handles the heavy compute, using techniques like continuous batching to serve thousands of users simultaneously.

Local fallback lane

Not every keystroke should go through the full remote generation path. A production client usually keeps a deterministic lane for the cheap, high-confidence cases:

Local symbols and imported APIs: Exact member completion from the parser or language server.
Prefix indexes or tries: Fast keyword and snippet lookup with predictable latency.
Fuzzy matching: Edit-distance recovery for small typos like pritn instead of print.

That local lane does two jobs: sub-50ms suggestions for exact matches, and a graceful fallback when the network is slow, the model abstains, or the user is working in a restricted environment.

By splitting the responsibilities across these layers, the architecture minimizes the volume of data sent over the network. The client handles lightweight heuristics and syntax parsing, so only highly qualified prompts reach the backend, where expensive GPU resources are dedicated strictly to token generation. The architecture visual above traces that path from the client to the inference server and back.

Context gathering and management

The model needs enough context to make useful suggestions, but the live prompt budget is deliberately kept small because long prefills destroy latency. Even if the base model advertises a much larger context window, you can't stuff the whole repository into the keystroke path.

Context priority strategy

Layer context sources based on their immediate relevance to the cursor position. Since an entire codebase won't fit into the model's context window, rank information strictly by its likelihood of influencing the next few tokens. The context engine builds the prompt dynamically, filling the available budget (e.g., 8k tokens) from the top priority down until space is exhausted:

Priority	Source	Method	Rationale
1 (Highest)	Code before cursor	Direct prefix	Immediate grammatical context.
2	Code after cursor (suffix)	FIM Suffix	Needed to close brackets, match types.
3	Imports & Definitions	Static Analysis	Types and functions used in the file.
4	Recently Edited Files	Temporal locality	Code you just touched is likely relevant.
5	Neighboring Files	Jaccard Similarity	Files that share imports with current file.

Fill-in-the-middle (FIM)

Standard causal language models predict the next token based only on the past (left-to-right). In coding, you often insert code in the middle of a file. If the model ignores the suffix (the code after the cursor), it might generate valid code that conflicts with the closing braces or logic below.

Analogy: Standard causal modeling is typing on a typewriter: you can only add to the end of the page. FIM (Fill-in-the-Middle) is a modern word processor: you can insert text in the middle of a sentence, and the system uses the surrounding context (both left and right) so the insertion fits correctly.

FIM reorders the prompt so the model sees the suffix before generating the middle. Marker strings differ by tokenizer and model; the notation below names their roles rather than defining a universal API:

$P_{\text{FIM}} = \text{<PRE>} \cdot x_{\text{prefix}} \cdot \text{<SUF>} \cdot x_{\text{suffix}} \cdot \text{<MID>}$

Fill-in-the-middle prompt construction where code before cursor becomes prefix, code after cursor becomes suffix, and model generates middle. — FIM prompt construction gives the model both sides of the edit so generated code fits the surrounding function, tests, and closing syntax.

Building the FIM prompt

Build the input in FIM order: prefix marker + code before cursor + suffix marker + code after cursor + middle marker. Then the model generates the missing middle section.

Use a concrete example: you're editing an authorization helper and your cursor sits inside an empty function body.

Prefix (code before cursor)

prefix-code-before-cursor.py

from authz.policies import PolicyGraph

def validate_token_scope(request):
    """Check whether the requested API scope is allowed."""
    is_allowed =

Suffix (code after cursor)

suffix-code-after-cursor.py

    return is_allowed

FIM prompt sent to the model

text

<PRE>from authz.policies import PolicyGraph

def validate_token_scope(request):
    """Check whether the requested API scope is allowed."""
    is_allowed = <SUF>
    return is_allowed
<MID>

The model now generates the middle section, using both the docstring above and the return is_allowed below to infer that it should write a policy-scope check, not an unrelated parser. Without the suffix, it could generate code that never produces is_allowed, leaving the following line broken.

Before wiring up a real tokenizer, test the transformation itself. A serving adapter would replace these readable markers with the exact sentinel tokens required by its selected FIM-capable model.

format-fim-request.py

def format_fim(prefix: str, suffix: str) -> str:
    return f"<PRE>{prefix}<SUF>{suffix}<MID>"

prefix = "def validate_token_scope(request):\n    is_allowed = "
suffix = "\n    return is_allowed\n"
prompt = format_fim(prefix, suffix)

assert prompt.endswith("<MID>")
assert prompt.index("<SUF>") < prompt.index("return is_allowed")
print(prompt.replace("\n", "\\n"))

Output

<PRE>def validate_token_scope(request):\n    is_allowed = <SUF>\n    return is_allowed\n<MID>

FIM gives a decoder-only model suffix information without changing the left-to-right decoding mechanism. Bavarian et al. found that transformed training data can add infilling ability while preserving ordinary generation performance in their experiments.^{[4]Reference 4Efficient Training of Language Models to Fill in the Middle.https://arxiv.org/abs/2207.14255}

Repository-level context

Large repos need more than simple file buffering. Two lightweight retrieval methods run fast enough to stay inside the keystroke budget.

Jaccard Similarity: Calculate the intersection of unique tokens (variable names, imports) between the current file and other open files. High overlap means high relevance. For example, if the active file scope_validator.py imports PolicyGraph, ScopeRule, and TokenClaims, and scope_rules.py shares two of those three names, its Jaccard score is 2/4 = 0.5 (high enough to pull in a few symbol definitions from it).
BM25 / Sparse Retrieval: A lightweight keyword search over the local repo index to find defining files for classes used in the current buffer.^{[5]Reference 5The Probabilistic Relevance Framework: BM25 and Beyond.https://doi.org/10.1561/1500000019}

Dense vector retrieval is often kept off the hottest keystroke path unless it's cached or precomputed. Once you include lookup, reranking, and prompt assembly, sparse search or heuristic graph traversal can be easier to keep inside a small budget. Pre-build a small symbol graph at editor startup rather than scanning a repository on every edit.

This tiny selector makes the overlap heuristic concrete. It retrieves only a definition-bearing file that shares the active symbols; unrelated code is excluded from the live prompt:

select-nearby-context.py

active_symbols = {"PolicyGraph", "ScopeRule", "TokenClaims"}
candidate_files = {
    "scope_rules.py": {"ScopeRule", "TokenClaims", "AuditEvent"},
    "metrics.py": {"Counter", "Histogram"},
    "policy_graph.py": {"PolicyGraph", "Node"},
}

def jaccard(left: set[str], right: set[str]) -> float:
    return len(left & right) / len(left | right)

ranked = sorted(
    ((jaccard(active_symbols, symbols), path) for path, symbols in candidate_files.items()),
    reverse=True,
)
selected = [path for score, path in ranked if score >= 0.25]

assert selected == ["scope_rules.py", "policy_graph.py"]
print("selected_context:", selected)

Output

selected_context: ['scope_rules.py', 'policy_graph.py']

Model serving for low latency

Serving code models requires optimizing for Time-To-First-Token (TTFT).

Speculative decoding

Speculative decoding reduces latency by using a smaller, faster "draft" model to predict the next $K$ tokens, which are then verified in parallel by a larger, more accurate "target" model.^{[6]Reference 6Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192} Modern serving stacks expose this as an operational feature with workload-specific caveats, so it still needs acceptance-rate measurement before rollout.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

Speculative decoding pairs a fast draft with a stricter verifier. The draft model proposes a few tokens, and the target model verifies them in parallel. If the target agrees, the client can stream several tokens after one target pass. If the draft misses, the target supplies the correction.

Speculative decoding path with draft tokens, verification, acceptance rate, and serving decision. — Speculative decoding is an acceptance-rate bet. Good drafts buy speed; low-acceptance drafts only add work.

The process works in three stages:

Draft model generation

The smaller draft model cheaply predicts candidate tokens. For example, it might generate def parse_config_value(source): based on the surrounding class and import context.

Target model verification

The larger target model runs one forward pass on the sequence and verifies the draft tokens.

Acceptance or correction

If the target model agrees with a prefix of the draft, the server emits that verified prefix without separate target decode steps for each accepted token. If there's a mismatch, generation proceeds from the target model's corrected token.

Code can contain repetitive spans such as standard imports and boilerplate loops, which may make useful draft acceptance easier to achieve. Measure that hypothesis by language and suggestion type.

A concrete latency example

Suppose the draft model is 5x faster than the target model because it has fewer layers and smaller weights. The draft model guesses 5 tokens ahead. The target model verifies all 5 in a single parallel pass:

If 4 out of 5 tokens match: You produced 4 tokens from one target-model verification pass plus the cheap draft work (5 draft steps cost about one target step at 5x). That's roughly 4 useful tokens for the price of about 2 target steps, near a 2x speedup, and the exact gain depends on draft cost and acceptance length.
If only 1 out of 5 matches: You accept that single token and discard the rest. Net speedup is small or even slightly negative once draft overhead is counted, but you still make progress without a full target step per token.
Worst case: The draft model misses the very first token. The target model outputs its own token, and you fall back to normal single-token generation for that step.

Repetitive code such as import blocks or standard exception handling may give a draft model useful acceptance rates. Treat that as a rollout hypothesis, not a property of all code workloads: measure accepted draft length and end-to-end latency by language and suggestion type before enabling it broadly.^{[6]Reference 6Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}

The rollout gate can be expressed as a small policy. Here, Python boilerplate improves latency enough to enable speculation, while a low-acceptance configuration remains on ordinary decoding:

gate-speculative-decoding.py

measurements = {
    "python_imports": {"mean_accepted_tokens": 3.8, "baseline_p95_ms": 178, "spec_p95_ms": 136},
    "sql_queries": {"mean_accepted_tokens": 1.1, "baseline_p95_ms": 169, "spec_p95_ms": 176},
}

def enable_speculation(sample: dict[str, float]) -> bool:
    latency_gain_ms = sample["baseline_p95_ms"] - sample["spec_p95_ms"]
    return sample["mean_accepted_tokens"] >= 2.0 and latency_gain_ms >= 10

enabled = [name for name, sample in measurements.items() if enable_speculation(sample)]
assert enabled == ["python_imports"]
print("speculative_decode_enabled_for:", enabled)

Output

speculative_decode_enabled_for: ['python_imports']

KV cache reuse (prefix caching)

Developers often type, pause, and type again in the same file. The file's header (imports, class definitions, and previously written functions) remains constant across these rapid sequential interactions.

Instead of prefilling the same stable file header on every request, the server can cache Key-Value (KV) blocks for a shared prefix. When the next prompt begins with the same cacheable blocks under the same tenant and model policy, it reuses those blocks and prefills only the uncached delta. Automatic prefix caching does not make generation of new output tokens cheaper; it removes duplicate prefill work for reused input context.^{[8]Reference 8Automatic Prefix Cachinghttps://docs.vllm.ai/en/latest/features/automatic_prefix_caching/}

Why this matters in practice

Next request	Reusable prefix work	Remaining work
Same stable header, a few characters appended	Skip prefill for matching cached blocks	Prefill new input delta, then decode suggestion tokens
Edit near the top of the file	Only blocks before the changed point can match	Prefill from first changed block onward, then decode
Different tenant, model, tokenizer, or cache policy	No permitted reuse	Full prompt prefill, then decode

The savings therefore depend on stable prompt prefixes, block matching, routing affinity, and isolation rules. A long stable header with a small cursor-adjacent delta is a good candidate. A request that changed early context or crosses a tenant boundary is a miss by design.

Prefix cache reuse for code completion where one stable file header is prefetched once, then a nearby edit either lands on the same shard and reuses that prefix or lands elsewhere and pays full prefill again. — Prefix caches help only when routing preserves locality. Nearby edits still miss if shard affinity or cache keys don't line up.

Use a cache key that enforces isolation as well as affinity. The example reuses a stable header for the same tenant and model, but never treats another tenant's identical text as a hit:

prefix-cache-contract.py

def cache_key(tenant: str, model: str, tokenizer: str, prefix: str) -> tuple[str, str, str, str]:
    return tenant, model, tokenizer, prefix

stable_prefix = "from authz.policies import PolicyGraph\n"
cached = {
    cache_key("acme-devtools", "code-fim-v3", "tok-v3", stable_prefix): "kv-block-91",
}

same_scope = cache_key("acme-devtools", "code-fim-v3", "tok-v3", stable_prefix)
other_tenant = cache_key("contoso-tools", "code-fim-v3", "tok-v3", stable_prefix)

assert cached.get(same_scope) == "kv-block-91"
assert cached.get(other_tenant) is None
print("same_scope_hit:", same_scope in cached)
print("cross_tenant_hit:", other_tenant in cached)

Output

same_scope_hit: True
cross_tenant_hit: False

This can eliminate repeated prefill work on sequential edits when the prefix stays stable and the reuse boundary is valid.

Quantization

Code models may tolerate post-training quantization, but you still have to measure acceptance and retained-edit metrics after compression. Relative to FP16 weights, INT8 and INT4 weight storage can reduce model-memory traffic by about 2x and 4x before format overhead. That can materially help completion workloads.^{[9]Reference 9GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformershttps://arxiv.org/abs/2210.17323}^{[10]Reference 10GPTQhttps://huggingface.co/docs/transformers/main/en/quantization/gptq}

This reduction matters for deployment economics. Token-by-token decode often puts pressure on memory bandwidth, so weight quantization can improve throughput and sometimes latency. TTFT can still be dominated by queueing, context length, and prefill work. Profile prefill and decode separately before claiming a win. In practice, weight compression can make the difference between serving a mid-sized code model on one 24 GB to 48 GB GPU versus needing multi-GPU model parallelism, depending on the KV-cache budget and context length.

Techniques like GPTQ (General-purpose Quantization) allow this compression to happen post-training. This means you can take off-the-shelf open weights and compress them for your specific hardware limits without needing to run an expensive fine-tuning pass.

Request lifecycle and user experience

The client (IDE) needs request control. Sending remote work for every edit event increases cancelled work and can exhaust the service's latency capacity during bursts. Instead, the client manages requesting, waiting, and cancelling based on the current editor state.

Debouncing and cancellation

We implement a dynamic debounce strategy:

0ms delay on "trigger characters" (e.g., ., (, \n).
150ms delay on standard typing.

The client-side RequestManager is the gatekeeper, deciding when to hit the server and when to wait. It takes individual keystrokes as input and either triggers an immediate API request or schedules a delayed one based on the character type. The output is a controlled stream of API requests to the server, preventing network congestion. on_type runs on every keystroke: trigger characters (., (, \n) fire immediately, while ordinary characters cancel any pending request and schedule a new one with a 150ms delay. Each request also gets a monotonically increasing ID so late responses can be ignored safely:

debouncing-and-cancellation.py

import asyncio

class RequestManager:
    """Manages debounce, cancellation, and stale-response suppression."""

    def __init__(self):
        self.pending_task: asyncio.Task[None] | None = None
        self.latest_request_id = 0
        self.trigger_chars = {'.', '(', '\n'}
        self.sent_requests: list[int] = []
        self.shown_suggestions: list[str] = []

    async def on_type(self, char: str) -> None:
        if self.pending_task and not self.pending_task.done():
            self.pending_task.cancel()

        delay_sec = 0.0 if char in self.trigger_chars else 0.15
        self.latest_request_id += 1
        request_id = self.latest_request_id
        self.pending_task = asyncio.create_task(
            self.debounce_fetch(delay_sec, request_id)
        )

    async def debounce_fetch(self, delay_sec: float, request_id: int) -> None:
        try:
            await asyncio.sleep(delay_sec)
            suggestion = await self.fetch_completion(request_id)
            self.handle_response(request_id, suggestion)
        except asyncio.CancelledError:
            pass

    async def fetch_completion(self, request_id: int) -> str:
        # Real clients also attach request_id to an abortable HTTP request.
        self.sent_requests.append(request_id)
        return f"completion-{request_id}"

    def handle_response(self, request_id: int, suggestion: str) -> None:
        if request_id == self.latest_request_id:
            self.show_ghost_text(suggestion)

    def show_ghost_text(self, suggestion: str) -> None:
        self.shown_suggestions.append(suggestion)

async def demo() -> None:
    manager = RequestManager()
    await manager.on_type('r')
    await asyncio.sleep(0.05)
    await manager.on_type('o')
    await asyncio.sleep(0.05)
    await manager.on_type('.')

    if manager.pending_task:
        await manager.pending_task

    manager.handle_response(2, "late-completion-2")
    print("sent_requests:", manager.sent_requests)
    print("shown_suggestions:", manager.shown_suggestions)

asyncio.run(demo())

Output

sent_requests: [3]
shown_suggestions: ['completion-3']

In a production editor, you usually combine both layers: abort the HTTP request when possible, and still guard UI updates with request IDs in case the server races or ignores the cancellation.

Key features of modern systems

Modern coding systems expose features that go far beyond predicting the next word:

Ghost text: Inline suggestions that appear in gray directly in the editor as you type. If the user types a character matching the suggestion prefix, the client can keep the remaining text visible. An accept action such as pressing Tab inserts the remaining suggestion.^{[11]Reference 11Programmatic Language Features: Show Inline Completionshttps://code.visualstudio.com/api/language-extensions/programmatic-language-features}
Natural language to code: Writing a comment like // validate token scopes against the policy graph and having the code appear automatically.
Repo-wide reasoning: The ability to ask "Where is the authentication logic handled?" and get a code suggestion that fits that specific architecture.
Unit test generation: Automatically writing tests for the code it just helped you create.

Evaluation metrics

How do we know if the system is good? Unlike typical conversational LLMs where quality is subjective and hard to measure automatically, code completion provides immediate, objective feedback: the user either accepts the suggestion or they don't. However, relying solely on simple acceptance isn't enough to capture the true value provided by the system.

A useful evaluation framework measures both the frequency of helpful suggestions and the work they retain, while checking latency objectives. Tracking these metrics across languages, suggestion classes, and traffic periods can reveal model regressions or infrastructure bottlenecks hidden by aggregates.

Completion quality funnel where shown suggestions narrow into accepted suggestions, then into retained work, while latency and freshness gates decide whether those wins count. — Raw acceptance is easy to game. Good code completion still has to be shown, kept, and delivered before it goes stale.

Metric	Definition	Why it matters
Acceptance Rate	% of shown suggestions inserted by the user.	GitHub's published study reported a 27% acceptance rate in its sample and found acceptance rate best predicted perceived productivity among its usage measurements. It's still gameable with tiny safe suggestions, so pair it with value metrics.^{[12]Reference 12Measuring GitHub Copilot's Impact on Productivityhttps://cacm.acm.org/research/measuring-github-copilots-impact-on-productivity/}
Accepted-and-retained characters	Characters from accepted suggestions that still remain after a chosen observation window.	Captures value that survives editing, not the click alone. The GitHub study measured unchanged and mostly unchanged completion persistence at several time windows, reinforcing why retention complements acceptance.^{[12]Reference 12Measuring GitHub Copilot's Impact on Productivityhttps://cacm.acm.org/research/measuring-github-copilots-impact-on-productivity/}
Completion Shown Rate	% of eligible requests that surface a suggestion.	A model that abstains too often won't feel helpful even if the few suggestions it shows are accurate.
Latency P99	99th percentile (P99) response time.	Slow suggestions break flow. The tail matters as much as the median.

Offline evaluation on benchmarks like HumanEval^{[13]Reference 13Evaluating Large Language Models Trained on Code (HumanEval).https://arxiv.org/abs/2107.03374} is useful for catching model regressions, but it doesn't measure editor timing, shown-rate policy, or accepted edits that users later undo. An online experiment is needed to determine whether a model or context heuristic improves the product experience.

Optimizing solely for acceptance rate can lead to a model that only suggests short, obvious tokens like closing parens because they are safe. You need value metrics such as retained characters, not click-through alone.

The metric computation should retain that distinction. In this example, three accepted suggestions become only one meaningfully retained edit:

measure-retained-completions.py

suggestions = [
    {"shown": True, "accepted_chars": 18, "retained_chars": 0},
    {"shown": True, "accepted_chars": 4, "retained_chars": 4},
    {"shown": True, "accepted_chars": 42, "retained_chars": 35},
    {"shown": True, "accepted_chars": 0, "retained_chars": 0},
]

shown = len(suggestions)
accepted = sum(item["accepted_chars"] > 0 for item in suggestions)
accepted_chars = sum(item["accepted_chars"] for item in suggestions)
retained_chars = sum(item["retained_chars"] for item in suggestions)

acceptance_rate = accepted / shown
retained_char_rate = retained_chars / accepted_chars
assert acceptance_rate == 0.75
assert round(retained_char_rate, 3) == 0.609
print("acceptance_rate:", f"{acceptance_rate:.0%}")
print("retained_char_rate:", f"{retained_char_rate:.1%}")

Output

acceptance_rate: 75%
retained_char_rate: 60.9%

Model architecture choices

Choosing the right model for code completion involves a trade-off between quality and latency. A more capable model may produce stronger multi-line candidates but arrive after the user has moved on. A fast model that emits incorrect syntax or APIs also fails the product objective.

The usual production pattern is to keep the keystroke path on a smaller, specialized model and reserve slower, more capable models for bigger edits or chat-style tasks.

Small vs. large models

A practical design tests smaller, completion-focused models on the keystroke path and reserves higher-latency model routes for broader edits. In practice, those roles often separate: inline-completion candidates tuned for FIM and low TTFT, and more capable general code models for harder multi-file work. Evaluate the split rather than assuming model size alone predicts usefulness. Useful techniques include:

Code-specific pre-training and fine-tuning: Models such as Qwen2.5-Coder^{[14]Reference 14Qwen2.5-Coder Technical Reporthttps://arxiv.org/abs/2409.12186} show how code-focused pre-training plus completion-specific tuning can make smaller models punch above their size on code-generation benchmarks.
Speculative decoding: Pair a small draft model with a larger target model so the target verifies several tokens at once.
Distillation: Training small models using outputs from larger models as training signal, transferring capability at lower inference cost.

Fill-in-the-middle training details

FIM-capable models require a specific training objective. During pre-training, a subset of training examples undergoes a transformation:

Choose an infill span by splitting the sequence at two boundaries into (prefix, middle, suffix).
Reorder to (prefix, suffix, middle) with special sentinel tokens between segments.
Train the model to predict the middle segment given (prefix, suffix).

The two dominant FIM formats are:

PSM (Prefix-Suffix-Middle): <PRE>prefix<SUF>suffix<MID>middle. This is the most common format.
SPM (Suffix-Prefix-Middle): <SUF>suffix<PRE>prefix<MID>middle. Here the prefix and the generated middle form one contiguous span, which makes continuation slightly more natural and tends to score marginally higher on infilling benchmarks.

Bavarian et al. found that jointly training both formats transfers positively, and that a high FIM rate (they tested up to 90%) costs little or nothing on ordinary left-to-right generation, the "FIM-for-free" result.^{[4]Reference 4Efficient Training of Language Models to Fill in the Middle.https://arxiv.org/abs/2207.14255} In practice teams still tune the mix empirically, because too much infilling-specific data can shift behavior on plain continuation. They also recommend applying the FIM split at the character level so the model stays stable when the cursor lands in the middle of a token.

Multi-line vs. single-line suggestions

The system should dynamically decide how much code to suggest based on the current context and the likelihood of generating a complete logical block. Generating too much code when the user only wants a single word is distracting and wastes compute, while generating too little inside a new function body defeats the purpose of the tool:

Trigger	Suggestion Type	Scenario p95 objective
Mid-expression (after `.`, `(`)	Type-informed local suggestion first	60 ms
End of line	Single line completion	200 ms
After function signature	Multi-line body	500 ms
Empty line in a function	Multi-line block	500 ms

A lightweight classifier on the client side can select a request class before sending remote work, allowing the server to use different models or decoding strategies per type. For example, multi-line completions might be routed to a more capable model, while exact member completion remains local.

route-completion-request.py

def route(trigger: str, after_signature: bool) -> tuple[str, int]:
    if trigger in {".", "("}:
        return "LOCAL_SEMANTIC", 60
    if after_signature:
        return "REMOTE_MULTILINE", 500
    return "REMOTE_INLINE", 200

cases = [
    route(".", False),
    route("\n", True),
    route("r", False),
]
assert cases == [
    ("LOCAL_SEMANTIC", 60),
    ("REMOTE_MULTILINE", 500),
    ("REMOTE_INLINE", 200),
]
print("routes:", cases)

Output

routes: [('LOCAL_SEMANTIC', 60), ('REMOTE_MULTILINE', 500), ('REMOTE_INLINE', 200)]

Production scaling

Serving an LLM is compute-intensive, and code completion adds high-churn requests that can become stale while a developer types. Provisioning and admission control should preserve measured latency objectives during traffic spikes.

Many completion requests are short or cancelled before display. Measuring that churn makes request suppression, prefix reuse, and abstention policies concrete cost decisions rather than assumed wins.

GPU fleet management and load balancing

Running code completion for a global developer fleet requires GPU provisioning and routing that balance latency, cache affinity, and isolation. The gateway can distribute requests using several strategies:

Prefix-aware routing: Route requests that share long prompt prefixes to the same GPU instance to maximize KV cache hit rates. If Developer A is editing src/auth/handler.py, subsequent requests for that file should hit the same shard.
Geographic routing: Prefer an approved nearby GPU region when residency and capacity permit. In this scenario, a 50 ms network round trip consumes one quarter of the 200 ms objective.
Model tiering: Use smaller models for single-token completions and larger models for multi-line suggestions, routing based on the expected suggestion type.

Continuous batching

Waiting to fill a fixed batch can add queue delay before first-token work begins. Continuous batching allows requests to join and leave an active serving schedule. vLLM^{[15]Reference 15Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180} supports optimized scheduling and uses PagedAttention to manage KV-cache storage in blocks. TGI also documents continuous batching, although Hugging Face now labels TGI maintenance mode and recommends newer engines such as vLLM or SGLang for future work.^{[16]Reference 16Text Generation Inference.https://huggingface.co/docs/text-generation-inference/index} Tune batching against TTFT and throughput because higher utilization can still harm inline latency when queues grow.

Cost economics

At scale, code completion is expensive because the IDE emits a steady stream of short-lived requests while the user is typing. The key design question isn't just cost per token. It's cost per useful suggestion.

High churn: Many requests are cancelled before the user ever sees the output.
Prefill-heavy workload: Much of the cost sits in reading context, not in generating long responses.
Useful north-star metric: Track cost per accepted-and-retained suggestion, not cost per request alone.

Prefix caching, request cancellation, and better abstention policies improve both cost and user experience at the same time.

Security, challenges, and ethics

Code completion systems may process proprietary source code, configuration, credentials accidentally present in buffers, and repository metadata. Privacy and security architecture therefore determine what the product is allowed to see and retain.

A useful design begins with explicit contracts: which buffers may be uploaded, whether prompts can be logged or used for training, how long operational data is retained, and which controls prevent cross-tenant reuse.

Data isolation

For an enterprise configuration that forbids code reuse across organizations, enforce tenant isolation across request ingestion, telemetry, and caches:

No training on user code by default. Model weights are frozen at deployment. User code is processed transiently for inference only.
Prompt data retention: Define explicit, minimal retention policies. Enterprises often require no raw-code retention or short, auditable windows for operational logs.
Tenant isolation: Prevent one organization's code context from leaking through another organization's cache, logs, or outputs. Shared GPU batches can be acceptable only when the runtime preserves per-request state boundaries and observability keeps tenant scopes intact. Partition batches where policy or implementation can't prove that boundary.

The isolation contract should be executable. A logging policy can retain safe metadata for latency debugging while dropping raw source by default:

apply-prompt-retention-policy.py

def audit_record(request: dict[str, str], retain_raw_code: bool) -> dict[str, str]:
    record = {
        "tenant": request["tenant"],
        "model": request["model"],
        "latency_bucket": request["latency_bucket"],
    }
    if retain_raw_code:
        record["prompt"] = request["prompt"]
    return record

request = {
    "tenant": "acme-devtools",
    "model": "code-fim-v3",
    "latency_bucket": "p95_under_200ms",
    "prompt": "API_TOKEN='secret-value'",
}
record = audit_record(request, retain_raw_code=False)

assert "prompt" not in record
print("audit_fields:", sorted(record))

Output

audit_fields: ['latency_bucket', 'model', 'tenant']

PII redaction

For enterprise usage, client-side redaction can reduce the chance that recognized secrets leave the developer's machine. Regular-expression and entropy-based scanners can catch some API keys, tokens, and passwords, while policy-based filters can mask selected identifiers before prompt construction. Scanners are incomplete, so redaction complements upload controls, restricted logging, access control, and incident response.

If a policy promises that matching secret patterns won't be uploaded, replacement must run before upload. The system replaces matched strings with placeholders such as <API_KEY> before constructing the remote prompt.

When the server returns a generated completion, the client should only reinsert masked values if the placeholder maps to a known local value. Otherwise it should keep the placeholder visible and require an explicit user edit. That prevents the model from inventing a secret-looking string and having the client silently treat it as real.

mask-known-secrets-before-upload.py

import re

TOKEN = re.compile(r"demo_api_token_[A-Za-z0-9]+")

def redact(text: str) -> tuple[str, dict[str, str]]:
    mapping: dict[str, str] = {}
    def replace(match: re.Match[str]) -> str:
        placeholder = f"<API_KEY_{len(mapping) + 1}>"
        mapping[placeholder] = match.group(0)
        return placeholder
    return TOKEN.sub(replace, text), mapping

prompt, local_mapping = redact("client = API('demo_api_token_abc123')")
assert "demo_api_token_" not in prompt
assert local_mapping["<API_KEY_1>"] == "demo_api_token_abc123"
print("upload_prompt:", prompt)

Output

upload_prompt: client = API('<API_KEY_1>')

On-premises deployment

Some enterprise policies prohibit sending source code to a shared external service. A product serving those customers may need private Virtual Private Cloud (VPC), self-hosted, or offline deployment options.

Self-hosted models: Deploy suitable code models inside customer-controlled infrastructure. This changes trust boundaries, but still requires identity, network, logging, and supply-chain controls.
Air-gapped environments: Support fully offline operation for classified environments. The context engine, model, and inference server all run locally.

The mastery gap and ethical considerations

A balanced look at code completion must address the potential downsides. A common concern is the "mastery gap": if developers accept suggestions without understanding them, their debugging and design instincts can weaken over time.

On the positive side, code completion cuts repetitive boilerplate, lowers the memorization burden of syntax and API details, and acts as live documentation for unfamiliar libraries. It also makes programming more approachable for newcomers and non-native English speakers.

On the negative side, the model can hallucinate libraries that don't exist or emit deprecated syntax. It might accidentally suggest code with security vulnerabilities, or reproduce copyrighted fragments from its training data, creating legal exposure. Telemetry and shared training pipelines can also leak proprietary code if the privacy architecture is weak. Finally, over-reliance is a real risk: junior developers who stop learning how the code works may find their debugging and design instincts weakening over time.

Mastery check

What strong answers show

Foundational: Designs a context builder that prioritizes prefix, suffix, imports, recent edits, and nearby files.
Intermediate: Explains why semantic completion should beat the LLM after exact trigger characters.
Intermediate: Breaks inline latency into client, network, context assembly, inference, and render budgets.
Advanced: Explains FIM training and why suffix context matters for insertion.
Advanced: Chooses speculative decoding, prefix caching, quantization, or smaller model tiers based on latency bottleneck.
Advanced: Uses accepted-and-retained work, shown rate, stale-response rate, and latency tails instead of raw acceptance alone.
Advanced: Designs cancellation and request-ID gating so stale results never overwrite fresher editor state.

Follow-up questions

Common pitfalls

Symptom: Suggestions appear after the user already typed past them. Cause: No real cancellation path, or UI trusts arrival order instead of request IDs. Fix: Abort in-flight work when possible and gate rendering on newest request ID.
Symptom: Completions are often exact but still feel unhelpful. Cause: System optimizes for raw acceptance with tiny safe suggestions. Fix: Track accepted-and-retained characters and shown rate, not acceptance alone.
Symptom: Member completion is slower and less accurate after . than IDE autocomplete used to be. Cause: Every keystroke is routed to the LLM instead of keeping semantic lane for deterministic cases. Fix: Let parser or language server own exact symbol completion and reserve GPU work for open-ended spans.
Symptom: Prefix caching hit rate stays low even though users edit the same file repeatedly. Cause: Routing breaks shard affinity, so matching prefixes miss the cached KV blocks. Fix: Add prefix-aware routing keyed by tenant, model, and stable prompt prefix.
Symptom: Inserted code fights the code below the cursor. Cause: System ignores suffix context or uses left-to-right continuation where infill is required. Fix: Use FIM prompt formatting and FIM-trained models for in-file edits.
Symptom: Model quality looks strong in offline code benchmarks but users still dislike the product. Cause: Benchmarks miss stale-response behavior, latency tails, abstention policy, and IDE interaction friction. Fix: Pair offline evals with online product metrics and real editor A/B tests.

Operating model

Code completion changes part of the user's work from typing to reviewing suggestions. Useful systems gather bounded context, reuse permitted computation, suppress stale output, and measure whether retained edits justify their latency and data access.

Inline-completion checks

In this scenario, inline context building, networking, inference, and rendering are budgeted against a 200 ms p95 objective.
Context is hierarchical. Prioritize the immediate cursor vicinity, then the file, then related imports. Dense retrieval on the keystroke path is usually too slow unless it's heavily optimized.
FIM (Fill-in-the-Middle) uses both sides of an edit. A FIM-capable model can use (Prefix, Suffix) context to generate a fitting middle span.^{[4]Reference 4Efficient Training of Language Models to Fill in the Middle.https://arxiv.org/abs/2207.14255}
Speculative decoding and KV cache prefix sharing are common optimizations for serving code models efficiently.^{[6]Reference 6Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}^{[15]Reference 15Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180}
Debouncing and cancellation reduce work that would arrive stale.
Upload, retention, training-use, cache-isolation, and deployment policies must match the customer's data contract.

You have now designed a low-latency, context-aware code completion system. The principles of hierarchical context construction, KV-cache prefix reuse, debouncing + cancellation, speculative decoding, and FIM training form a reusable foundation for other high-frequency, low-latency inference surfaces.

The inline-completion design is now measurable: context selection, infill formatting, freshness gates, prefill reuse, and privacy contracts. The next capstone turns that single-product serving path into a shared multi-tenant GPU platform with isolation, budgets, and safe rollouts.

Next Step

Continue to Multi-Tenant LLM Platform

Code completion gave you one ultra-low-latency product path: context management, KV reuse, speculative decoding, cancellation, and freshness gates. Now you'll design the shared production platform that hosts many specialized models on shared GPUs while enforcing isolation, per-tenant token budgets, cost attribution, and safe canary rollouts.

PreviousContent Moderation System

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

GitHub Copilot code suggestions in your IDE

GitHub · 2026

GitHub Copilot cloud agent

GitHub · 2026

Language Server Protocol

Microsoft · 2026

Efficient Training of Language Models to Fill in the Middle.

Bavarian, M., et al. · 2022 · arXiv preprint

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

Speculative Decoding

vLLM Team · 2026 · vLLM Documentation

Automatic Prefix Caching

vLLM · 2026

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

Frantar, E., et al. · 2023 · ICLR 2023

GPTQ

Hugging Face · 2026

Programmatic Language Features: Show Inline Completions

Visual Studio Code · 2026

Measuring GitHub Copilot's Impact on Productivity

Ziegler, A., et al. · 2024

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

Qwen2.5-Coder Technical Report

Qwen Team, Alibaba Group · 2024 · arXiv preprint

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Text Generation Inference.

Hugging Face · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Code Completion System

What makes code completion harder than a normal chat product?

The evolution of code completion

Static and rule-based

Semantic completion

Inline generation

Agent workflows

Why should an interview answer separate inline completion from coding agents?

System requirements

If a completion arrives after the user typed three more characters, what should the client do?

Architecture

When should deterministic semantic completion beat the LLM?

Local fallback lane

Why is a local fallback lane a product requirement rather than an optimization?

Context gathering and management

Context priority strategy

Fill-in-the-middle (FIM)

Building the FIM prompt

Prefix (code before cursor)

Suffix (code after cursor)

FIM prompt sent to the model

Why does fill-in-the-middle matter more for IDE completion than for chat?

Repository-level context

Why is "send the whole repository" the wrong context strategy?

Model serving for low latency

Speculative decoding

Draft model generation

Target model verification

Acceptance or correction

A concrete latency example

Your draft model proposes 5 tokens and the target accepts 3. Did speculative decoding fail?

KV cache reuse (prefix caching)

Why this matters in practice

Why is prefix caching especially useful for a developer typing in one file?

Quantization

What metric should you re-check after quantizing a code model?

Request lifecycle and user experience

Debouncing and cancellation

Why do you need both HTTP cancellation and request-ID gating?

Key features of modern systems

Evaluation metrics

Why is acceptance rate easy to game?

Model architecture choices

Small vs. large models

Fill-in-the-middle training details

Multi-line vs. single-line suggestions

Why route single-token and multi-line completions differently?

Production scaling

GPU fleet management and load balancing

Continuous batching

Cost economics

What is a better cost metric than cost per request?

Security, challenges, and ethics

Data isolation

PII redaction

Why must secret redaction happen before prompt upload?

On-premises deployment

The mastery gap and ethical considerations

What ethical failure can happen even when completions are technically correct?

Mastery check

What strong answers show

Follow-up questions

You see high acceptance but low retained characters. What failure pattern does that suggest?

Your completion system has great median latency but bad p99. Why can users still hate it?

The language server knows exact members after request.. Why still keep the LLM lane alive?

Why is prefix-aware routing required before prefix caching can save real money?

Your draft model acceptance drops from 4/5 to 1/5 in one language. What should you do?

Common pitfalls

Operating model

Inline-completion checks

Mastery Check

Discussion