Design a real-time code completion path with context construction, measured serving latency, privacy controls, and stale-result suppression.
Content moderation designed a low-latency safety pipeline that decides whether content can enter the product. Code completion uses many of the same production muscles, but the pressure point changes: every keystroke can create a fresh inference request, and stale answers are worse than no answer.
A code completion system predicts useful edits under a latency budget tight enough to run while a developer is still typing. This design chapter covers context collection, ranking, serving, privacy, and feedback loops for developer tools.
Imagine editing warehouse-routing code where the next conditional, test assertion, or API call appears before you move to another line. Code completion tools can provide dimmed inline suggestions as a developer types, and can also offer next-edit suggestions.[1] Building that experience means balancing speed, context, and data handling. In this chapter's design scenario, the inline path has a 200 ms service objective; a real product must set and validate its own objective from user research and telemetry.
Code completion isn't just a popup list anymore. It has evolved into an intelligent developer assistant that uses context to predict and generate entire logic blocks, reducing cognitive load and automating repetitive tasks.
Modern coding assistants now span inline completion (suggestions at the cursor), next edit suggestions (updating nearby code without moving the cursor), and full coding agents (systems that can plan and execute multi-file changes from a higher-level request).
By this point in the curriculum, you've already seen how Transformers predict the next token and how the avoids recomputing shared work. This article puts those ideas into a product: a code completion system that may consider each edit event, but only sends qualified requests and only displays fresh results.
To understand how modern completion works, it helps to see how we got here.
| Era | Technology | Key Feature |
|---|---|---|
| 1970s-90s | Lexical completion & symbol tables | Local identifier and keyword completion |
| 1996+ (IntelliSense, Microsoft's contextual code-completion service) | Static analysis & ASTs | Type-aware method suggestions |
| 2010s | N-gram models & basic ML | Statistical patterns from GitHub |
| 2021+ (Copilot) | LLMs & transformers | Entire-function generation |
| 2025+ (Agentic) | Retrieval + tool use | Multi-file editing, tests, terminal actions |
Early completion engines were mostly lexical. They used token scanners, prefix indexes, symbol tables, and simple scope rules. That was enough to complete local variables and imported symbols, but not enough to understand intent.
IntelliSense popularized semantic completion in mainstream IDEs. Under the hood, modern semantic engines combine parsing, symbol tables, type inference, and compiler services. They know order.status is a string field on a concrete type, not just another token that happens to follow a dot.
Starting in 2021, tools like GitHub Copilot shifted the approach. The system stopped looking at code as plain text and started seeing it as a language pattern, using Large Language Models to predict entire lines or blocks.
By 2026, many coding products have started to blend inline completion with broader agent workflows such as retrieval, terminal commands, and repository-wide edits.[2] That's a layer above plain autocomplete, but it still depends on the same foundations: fast context gathering, good candidate generation, and tight latency control.
An inline suggestion competes with the user's next keystroke. If it arrives too late, the user has already written past it. If it is irrelevant, it becomes a distraction rather than an aid.
Designing for this environment requires balancing competing priorities: providing the model with enough context to be accurate while executing the inference quickly enough to be useful. The key constraints include:
Meeting these requirements requires more than exposing a chat endpoint. The full pipeline, from the client-side edit listener to the inference engine, needs measurement for time-to-first-token (TTFT). TTFT measures the delay from request submission until the first response token arrives.
One practical way to reason about the budget is to split it across stages: editor + network overhead, context assembly, first-token inference, and UI rendering. The exact numbers vary by region and model size, but the important point is that every stage is on the clock.
| Stage | Typical budget | Notes |
|---|---|---|
| Client event handling + local parse | 5-15ms | Capture the keystroke, cursor position, and lightweight syntax state. |
| Network round trip | 20-70ms | Depends heavily on region and whether the request stays close to the user. |
| Context assembly | 10-40ms | Build the prompt, gather nearby symbols, and fetch a few related files. |
| First-token inference | 40-90ms | Usually the hardest budget to hit because prefill dominates. |
| UI render | 5-15ms | Paint ghost text and avoid jank in the editor. |
For the scenario below, those stage budgets fit under a 200 ms p95 objective. Multi-line suggestions can have a different objective, but they still need cancellation and freshness checks.
Use an executable budget check instead of treating a latency target as a promise. This small calculation fails the candidate path when any stage pushes total latency over the scenario objective:
1STAGE_BUDGET_MS = {
2 "client_parse": 12,
3 "network": 55,
4 "context": 28,
5 "time_to_first_token": 78,
6 "paint": 10,
7}
8INLINE_P95_OBJECTIVE_MS = 200
9
10total_ms = sum(STAGE_BUDGET_MS.values())
11headroom_ms = INLINE_P95_OBJECTIVE_MS - total_ms
12
13assert total_ms <= INLINE_P95_OBJECTIVE_MS
14print("scenario_p95_budget_ms:", total_ms)
15print("headroom_ms:", headroom_ms)1scenario_p95_budget_ms: 183
2headroom_ms: 17The system consists of three main components: the IDE (Integrated Development Environment) extension (client), the API gateway (orchestration), and the inference engine (LLM).
The IDE Extension captures keystrokes and manages the local state (open tabs, cursor position). The Context Engine runs locally or on the gateway to select the most relevant code snippets to fit in the prompt window. The Language Server Protocol (LSP) is the standard interface between an editor and a language server, so the same semantic engine can power completion, go-to-definition, and diagnostics across multiple editors.[3]
In practice, deterministic semantic completion and LLM completion often run side by side. The language server produces symbol-aware candidates with exact type information, while the LLM produces longer infill candidates. The client can merge, rank, or gate these results depending on cursor position and confidence.
That hybrid design matters. For exact member completion after a dot, semantic candidates from the language server are often both faster and more reliable than free-form generation. The LLM earns its keep on longer spans, comments-to-code, and cases where the user intent isn't fully captured by the type system.
The Inference Server hosts the LLM and handles the heavy compute, using techniques like continuous batching to serve thousands of users simultaneously.
Not every keystroke should go through the full remote generation path. A production client usually keeps a deterministic lane for the cheap, high-confidence cases:
pritn instead of print.That local lane does two jobs. It gives you sub-50ms suggestions for exact matches, and it provides a graceful fallback when the network is slow, the model abstains, or the user is working in a restricted environment.
By splitting the responsibilities across these layers, the architecture minimizes the volume of data sent over the network. The client handles lightweight heuristics and syntax parsing, ensuring that only highly qualified prompts reach the backend, where expensive GPU resources are dedicated strictly to token generation. The architecture visual above shows this flow from the client to the inference server and back.
The model needs enough context to make useful suggestions, but the live prompt budget is deliberately kept small because long prefills destroy latency. Even if the base model advertises a much larger context window, you can't simply stuff the whole repository into the keystroke path.
We layer context sources based on their immediate relevance to the cursor position. Since we can't fit an entire codebase into the model's context window, we must rank information strictly by its likelihood of influencing the next few tokens. The context engine builds the prompt dynamically, filling the available budget (e.g., 8k tokens) from the top priority down until space is exhausted:
| Priority | Source | Method | Rationale |
|---|---|---|---|
| 1 (Highest) | Code before cursor | Direct prefix | Immediate grammatical context. |
| 2 | Code after cursor (suffix) | FIM Suffix | Needed to close brackets, match types. |
| 3 | Imports & Definitions | Static Analysis | Types and functions used in the file. |
| 4 | Recently Edited Files | Temporal locality | Code you just touched is likely relevant. |
| 5 | Neighboring Files | Jaccard Similarity | Files that share imports with current file. |
Standard causal language models predict the next token based only on the past (left-to-right). In coding, you often insert code in the middle of a file. If the model ignores the suffix (the code after the cursor), it might generate valid code that conflicts with the closing braces or logic below.
Analogy: Think of standard causal modeling like typing on a typewriter: you can only add to the end of the page. FIM (Fill-in-the-Middle) is like a modern word processor: you can insert text in the middle of a sentence, and the system uses the surrounding context (both left and right) to ensure the insertion fits correctly.
FIM reorders the prompt so the model sees the suffix before generating the middle. Marker strings differ by tokenizer and model; the notation below names their roles rather than defining a universal API:
Build the input in FIM order: prefix marker + code before cursor + suffix marker + code after cursor + middle marker. Then the model generates the missing middle section.
Here is a concrete example. Imagine you're editing a warehouse routing module and your cursor sits inside an empty function body:
1from warehouse.routing import RoutePlanner
2
3def validate_delivery_address(order):
4 """Check if the shipping address is inside our delivery zone."""1 return is_valid1<PRE>from warehouse.routing import RoutePlanner
2
3def validate_delivery_address(order):
4 """Check if the shipping address is inside our delivery zone."""
5<SUF> return is_valid
6<MID>The model now generates the middle section, using both the docstring above and the return is_valid below to infer that it should write a delivery-zone check, not a payment validator. Without the suffix, it could generate code that never produces is_valid, leaving the following line broken.
Before wiring up a real tokenizer, test the transformation itself. A serving adapter would replace these readable markers with the exact sentinel tokens required by its selected FIM-capable model.
1def format_fim(prefix: str, suffix: str) -> str:
2 return f"<PRE>{prefix}<SUF>{suffix}<MID>"
3
4prefix = "def delivery_zone(order):\n is_valid = "
5suffix = "\n return is_valid\n"
6prompt = format_fim(prefix, suffix)
7
8assert prompt.endswith("<MID>")
9assert prompt.index("<SUF>") < prompt.index("return is_valid")
10print(prompt.replace("\n", "\\n"))1<PRE>def delivery_zone(order):\n is_valid = <SUF>\n return is_valid\n<MID>FIM gives a decoder-only model suffix information without changing the left-to-right decoding mechanism. Bavarian et al. found that transformed training data can add infilling ability while preserving ordinary generation performance in their experiments.[4]
To handle large repos, we can't rely on simple file buffering. We use two lightweight retrieval methods that run fast enough to stay inside the keystroke budget.
route_planner.py imports RoutePlanner, AddressValidator, and DeliveryWindow, and address_utils.py shares two of those three names, its Jaccard score is 2/4 = 0.5 (high enough to pull in a few symbol definitions from it).Dense vector retrieval is often kept off the hottest keystroke path unless it is cached or precomputed. Once you include lookup, reranking, and prompt assembly, sparse search or heuristic graph traversal can be easier to keep inside a small budget. The key is to pre-build a small symbol graph at editor startup rather than scanning a repository on every edit.
This tiny selector makes the overlap heuristic concrete. It retrieves only a definition-bearing file that shares the active symbols; unrelated code is excluded from the live prompt:
1active_symbols = {"RoutePlanner", "AddressValidator", "DeliveryWindow"}
2candidate_files = {
3 "address_utils.py": {"AddressValidator", "DeliveryWindow", "PostalCode"},
4 "billing.py": {"Invoice", "Payment"},
5 "planner.py": {"RoutePlanner", "Truck"},
6}
7
8def jaccard(left: set[str], right: set[str]) -> float:
9 return len(left & right) / len(left | right)
10
11ranked = sorted(
12 ((jaccard(active_symbols, symbols), path) for path, symbols in candidate_files.items()),
13 reverse=True,
14)
15selected = [path for score, path in ranked if score >= 0.25]
16
17assert selected == ["address_utils.py", "planner.py"]
18print("selected_context:", selected)1selected_context: ['address_utils.py', 'planner.py']Serving code models requires optimizing for Time-To-First-Token (TTFT).
Speculative decoding reduces latency by using a smaller, faster "draft" model to predict the next tokens, which are then verified in parallel by a larger, more accurate "target" model.[6] Modern serving stacks expose this as an operational feature with workload-specific caveats, so it still needs acceptance-rate measurement before rollout.[7]
Think of speculative decoding as a senior engineer supervising a fast draft. The draft model proposes a few tokens, and the target model verifies them in parallel. If the target agrees, the client can stream several tokens after one target pass. If the draft misses, the target supplies the correction.
The process works in three stages:
The smaller draft model cheaply predicts candidate tokens. For example, it might generate def calculate_delivery_eta(order): based on the surrounding class and import context.
The larger target model runs one forward pass on the sequence and verifies the draft tokens.
If the target model agrees with a prefix of the draft, the server emits that verified prefix without separate target decode steps for each accepted token. If there's a mismatch, generation proceeds from the target model's corrected token.
It's particularly effective for code because code structure is highly repetitive (e.g., standard imports, boilerplate loops), making it easy for small models to guess correctly.
Suppose the draft model is 5x faster than the target model because it has fewer layers and smaller weights. The draft model guesses 5 tokens ahead. The target model verifies all 5 in a single parallel pass:
Repetitive code such as import blocks or standard exception handling may give a draft model useful acceptance rates. Treat that as a rollout hypothesis, not a property of all code workloads: measure accepted draft length and end-to-end latency by language and suggestion type before enabling it broadly.[6]
The rollout gate can be expressed as a small policy. Here, Python boilerplate improves latency enough to enable speculation, while a low-acceptance configuration remains on ordinary decoding:
1measurements = {
2 "python_imports": {"mean_accepted_tokens": 3.8, "baseline_p95_ms": 178, "spec_p95_ms": 136},
3 "sql_queries": {"mean_accepted_tokens": 1.1, "baseline_p95_ms": 169, "spec_p95_ms": 176},
4}
5
6def enable_speculation(sample: dict[str, float]) -> bool:
7 latency_gain_ms = sample["baseline_p95_ms"] - sample["spec_p95_ms"]
8 return sample["mean_accepted_tokens"] >= 2.0 and latency_gain_ms >= 10
9
10enabled = [name for name, sample in measurements.items() if enable_speculation(sample)]
11assert enabled == ["python_imports"]
12print("speculative_decode_enabled_for:", enabled)1speculative_decode_enabled_for: ['python_imports']Developers often type, pause, and type again in the same file. The file's header (imports, class definitions, and previously written functions) remains constant across these rapid sequential interactions.
Instead of prefilling the same stable file header on every request, the server can cache Key-Value (KV) blocks for a shared prefix. When the next prompt begins with the same cacheable blocks under the same tenant and model policy, it reuses those blocks and prefills only the uncached delta. Automatic prefix caching does not make generation of new output tokens cheaper; it removes duplicate prefill work for reused input context.[8]
| Next request | Reusable prefix work | Remaining work |
|---|---|---|
| Same stable header, a few characters appended | Skip prefill for matching cached blocks | Prefill new input delta, then decode suggestion tokens |
| Edit near the top of the file | Only blocks before the changed point can match | Prefill from first changed block onward, then decode |
| Different tenant, model, tokenizer, or cache policy | No permitted reuse | Full prompt prefill, then decode |
The savings therefore depend on stable prompt prefixes, block matching, routing affinity, and isolation rules. A long stable header with a small cursor-adjacent delta is a good candidate. A request that changed early context or crosses a tenant boundary is a miss by design.
Use a cache key that enforces isolation as well as affinity. The example reuses a stable header for the same tenant and model, but never treats another tenant's identical text as a hit:
1def cache_key(tenant: str, model: str, tokenizer: str, prefix: str) -> tuple[str, str, str, str]:
2 return tenant, model, tokenizer, prefix
3
4stable_prefix = "from warehouse.routing import RoutePlanner\n"
5cached = {
6 cache_key("shopflow", "code-fim-v3", "tok-v3", stable_prefix): "kv-block-91",
7}
8
9same_scope = cache_key("shopflow", "code-fim-v3", "tok-v3", stable_prefix)
10other_tenant = cache_key("marketplace-b", "code-fim-v3", "tok-v3", stable_prefix)
11
12assert cached.get(same_scope) == "kv-block-91"
13assert cached.get(other_tenant) is None
14print("same_scope_hit:", same_scope in cached)
15print("cross_tenant_hit:", other_tenant in cached)1same_scope_hit: True
2cross_tenant_hit: FalseThis can eliminate repeated prefill work on sequential edits when the prefix stays stable and the reuse boundary is valid.
Code models often tolerate post-training quantization, but you still have to measure acceptance and retained-edit metrics after compression. Running model weights in INT8 or even INT4 precision can reduce model-memory bandwidth demands by roughly 2-4x, which materially helps completion workloads.[9][10]
This reduction is critical for deployment economics. Completion traffic is usually memory-bandwidth-bound rather than pure-FLOP-bound, so quantization can improve both throughput and TTFT. In practice, it can make the difference between serving a mid-sized code model on one 24 GB to 48 GB GPU versus needing multi-GPU model parallelism, depending on the KV-cache budget and context length.
Techniques like GPTQ (General-purpose Quantization) allow this compression to happen post-training. This means you can take off-the-shelf open weights and compress them for your specific hardware limits without needing to run an expensive fine-tuning pass.
The client (IDE) needs request control. Sending remote work for every edit event increases cancelled work and can exhaust the service's latency capacity during bursts. Instead, the client manages requesting, waiting, and cancelling based on the current editor state.
We implement a dynamic debounce strategy:
., (, \n).The client-side RequestManager serves as the gatekeeper, deciding when to hit the server and when to wait. It takes individual keystrokes as input and either triggers an immediate API request or schedules a delayed one based on the character type. The output is a controlled stream of API requests to the server, preventing network congestion. The implementation below demonstrates this debouncing logic. The on_type method is called on every keystroke: if the character is a trigger character (., (, \n), it fires a request immediately. For ordinary characters, it cancels any pending request and schedules a new one with a 150ms delay. It also tags each request with a monotonically increasing ID so late responses can be ignored safely:
1import asyncio
2
3class RequestManager:
4 """Manages debounce, cancellation, and stale-response suppression."""
5
6 def __init__(self):
7 self.pending_task: asyncio.Task[None] | None = None
8 self.latest_request_id = 0
9 self.trigger_chars = {'.', '(', '\n'}
10 self.sent_requests: list[int] = []
11 self.shown_suggestions: list[str] = []
12
13 async def on_type(self, char: str) -> None:
14 if self.pending_task and not self.pending_task.done():
15 self.pending_task.cancel()
16
17 delay_sec = 0.0 if char in self.trigger_chars else 0.15
18 self.latest_request_id += 1
19 request_id = self.latest_request_id
20 self.pending_task = asyncio.create_task(
21 self.debounce_fetch(delay_sec, request_id)
22 )
23
24 async def debounce_fetch(self, delay_sec: float, request_id: int) -> None:
25 try:
26 await asyncio.sleep(delay_sec)
27 suggestion = await self.fetch_completion(request_id)
28 self.handle_response(request_id, suggestion)
29 except asyncio.CancelledError:
30 pass
31
32 async def fetch_completion(self, request_id: int) -> str:
33 # Real clients also attach request_id to an abortable HTTP request.
34 self.sent_requests.append(request_id)
35 return f"completion-{request_id}"
36
37 def handle_response(self, request_id: int, suggestion: str) -> None:
38 if request_id == self.latest_request_id:
39 self.show_ghost_text(suggestion)
40
41 def show_ghost_text(self, suggestion: str) -> None:
42 self.shown_suggestions.append(suggestion)
43
44async def demo() -> None:
45 manager = RequestManager()
46 await manager.on_type('r')
47 await asyncio.sleep(0.05)
48 await manager.on_type('o')
49 await asyncio.sleep(0.05)
50 await manager.on_type('.')
51
52 if manager.pending_task:
53 await manager.pending_task
54
55 manager.handle_response(2, "late-completion-2")
56 print("sent_requests:", manager.sent_requests)
57 print("shown_suggestions:", manager.shown_suggestions)
58
59asyncio.run(demo())1sent_requests: [3]
2shown_suggestions: ['completion-3']In a production editor, you usually combine both layers: abort the HTTP request when possible, and still guard UI updates with request IDs in case the server races or ignores the cancellation.
Modern coding systems expose features that go far beyond predicting the next word:
// validate address against delivery zones and having the code appear automatically.How do we know if the system is good? Unlike typical conversational LLMs where quality is subjective and hard to measure automatically, code completion provides immediate, objective feedback: the user either accepts the suggestion or they don't. However, relying solely on simple acceptance isn't enough to capture the true value provided by the system.
A useful evaluation framework measures both the frequency of helpful suggestions and the work they retain, while checking latency objectives. Tracking these metrics across languages, suggestion classes, and traffic periods can reveal model regressions or infrastructure bottlenecks hidden by aggregates.
| Metric | Definition | Why it matters |
|---|---|---|
| Acceptance Rate | % of shown suggestions inserted by the user. | GitHub's published study reported a 27% acceptance rate in its sample and found acceptance rate best predicted perceived productivity among its usage measurements. It is still gameable with tiny safe suggestions, so pair it with value metrics.[12] |
| Accepted-and-retained characters | Characters from accepted suggestions that still remain after a chosen observation window. | Captures value that survives editing, not just the click. The GitHub study measured unchanged and mostly unchanged completion persistence at several time windows, reinforcing why retention complements acceptance.[12] |
| Completion Shown Rate | % of eligible requests that actually surface a suggestion. | A model that abstains too often won't feel helpful even if the few suggestions it shows are accurate. |
| Latency P99 | 99th percentile (P99) response time. | Slow suggestions break flow. The tail matters as much as the median. |
Offline evaluation on benchmarks like HumanEval[13] is useful for catching model regressions, but it doesn't measure editor timing, shown-rate policy, or accepted edits that users later undo. An online experiment is needed to determine whether a model or context heuristic improves the product experience.
Optimizing solely for acceptance rate can lead to a model that only suggests short, obvious tokens like closing parens because they are safe. You need value metrics such as retained characters, not just click-through.
The metric computation should retain that distinction. In this example, three accepted suggestions become only one meaningfully retained edit:
1suggestions = [
2 {"shown": True, "accepted_chars": 18, "retained_chars": 0},
3 {"shown": True, "accepted_chars": 4, "retained_chars": 4},
4 {"shown": True, "accepted_chars": 42, "retained_chars": 35},
5 {"shown": True, "accepted_chars": 0, "retained_chars": 0},
6]
7
8shown = len(suggestions)
9accepted = sum(item["accepted_chars"] > 0 for item in suggestions)
10accepted_chars = sum(item["accepted_chars"] for item in suggestions)
11retained_chars = sum(item["retained_chars"] for item in suggestions)
12
13acceptance_rate = accepted / shown
14retained_char_rate = retained_chars / accepted_chars
15assert acceptance_rate == 0.75
16assert round(retained_char_rate, 3) == 0.609
17print("acceptance_rate:", f"{acceptance_rate:.0%}")
18print("retained_char_rate:", f"{retained_char_rate:.1%}")1acceptance_rate: 75%
2retained_char_rate: 60.9%Choosing the right model for code completion involves a trade-off between quality and latency. A more capable model may produce stronger multi-line candidates but arrive after the user has moved on. A fast model that emits incorrect syntax or APIs also fails the product objective.
The usual production pattern is to keep the keystroke path on a smaller, specialized model and reserve slower, more capable models for bigger edits or chat-style tasks.
A practical design tests smaller, completion-focused models on the keystroke path and reserves higher-latency model routes for broader edits. In practice, those roles often separate: inline-completion candidates tuned for FIM and low TTFT, and more capable general code models for harder multi-file work. Evaluate the split rather than assuming model size alone predicts usefulness. Useful techniques include:
FIM-capable models require a specific training objective. During pre-training, a subset of training examples undergoes a transformation:
(prefix, middle, suffix).(prefix, suffix, middle) with special sentinel tokens between segments.middle segment given (prefix, suffix).The two dominant FIM formats are:
<PRE>prefix<SUF>suffix<MID>middle. This is the most common format.<SUF>suffix<PRE>prefix<MID>middle. Here the prefix and the generated middle form one contiguous span, which makes continuation slightly more natural and tends to score marginally higher on infilling benchmarks.Bavarian et al. found that jointly training both formats transfers positively, and that a high FIM rate (they tested up to 90%) costs little or nothing on ordinary left-to-right generation, the "FIM-for-free" result.[4] In practice teams still tune the mix empirically, because too much infilling-specific data can shift behavior on plain continuation. They also recommend applying the FIM split at the character level so the model stays robust when the cursor lands in the middle of a token.
The system should dynamically decide how much code to suggest based on the current context and the likelihood of generating a complete logical block. Generating too much code when the user only wants a single word is distracting and wastes compute, while generating too little inside a new function body defeats the purpose of the tool:
| Trigger | Suggestion Type | Scenario p95 objective |
|---|---|---|
Mid-expression (after ., () | Type-informed local suggestion first | 60 ms |
| End of line | Single line completion | 200 ms |
| After function signature | Multi-line body | 500 ms |
| Empty line in a function | Multi-line block | 500 ms |
A lightweight classifier on the client side can select a request class before sending remote work, allowing the server to use different models or decoding strategies per type. For example, multi-line completions might be routed to a more capable model, while exact member completion remains local.
1def route(trigger: str, after_signature: bool) -> tuple[str, int]:
2 if trigger in {".", "("}:
3 return "LOCAL_SEMANTIC", 60
4 if after_signature:
5 return "REMOTE_MULTILINE", 500
6 return "REMOTE_INLINE", 200
7
8cases = [
9 route(".", False),
10 route("\n", True),
11 route("r", False),
12]
13assert cases == [
14 ("LOCAL_SEMANTIC", 60),
15 ("REMOTE_MULTILINE", 500),
16 ("REMOTE_INLINE", 200),
17]
18print("routes:", cases)1routes: [('LOCAL_SEMANTIC', 60), ('REMOTE_MULTILINE', 500), ('REMOTE_INLINE', 200)]Serving an LLM is compute-intensive, and code completion adds high-churn requests that can become stale while a developer types. Provisioning and admission control should preserve measured latency objectives during traffic spikes.
Many completion requests are short or cancelled before display. Measuring that churn makes request suppression, prefix reuse, and abstention policies concrete cost decisions rather than assumed wins.
Running code completion for a global developer fleet requires GPU provisioning and routing that balance latency, cache affinity, and isolation. The gateway can distribute requests using several strategies:
src/auth/handler.py, subsequent requests for that file should hit the same shard.Waiting to fill a fixed batch can add queue delay before first-token work begins. Continuous batching allows requests to join and leave an active serving schedule. Frameworks such as vLLM[15] and TGI[16] support optimized serving features; in vLLM, PagedAttention manages KV-cache storage in blocks. Tune batching against TTFT and throughput because higher utilization can still harm inline latency when queues grow.
At scale, code completion is expensive because the IDE emits a steady stream of short-lived requests while the user is typing. The key design question isn't just cost per token. It's cost per useful suggestion.
Prefix caching, request cancellation, and better abstention policies improve both cost and user experience at the same time.
Code completion systems may process proprietary source code, configuration, credentials accidentally present in buffers, and repository metadata. Privacy and security architecture therefore determine what the product is allowed to see and retain.
A useful design begins with explicit contracts: which buffers may be uploaded, whether prompts can be logged or used for training, how long operational data is retained, and which controls prevent cross-tenant reuse.
For an enterprise configuration that forbids code reuse across organizations, isolate tenant data across request ingestion, telemetry, and caches:
The isolation contract should be executable. A logging policy can retain safe metadata for latency debugging while dropping raw source by default:
1def audit_record(request: dict[str, str], retain_raw_code: bool) -> dict[str, str]:
2 record = {
3 "tenant": request["tenant"],
4 "model": request["model"],
5 "latency_bucket": request["latency_bucket"],
6 }
7 if retain_raw_code:
8 record["prompt"] = request["prompt"]
9 return record
10
11request = {
12 "tenant": "shopflow",
13 "model": "code-fim-v3",
14 "latency_bucket": "p95_under_200ms",
15 "prompt": "API_TOKEN='secret-value'",
16}
17record = audit_record(request, retain_raw_code=False)
18
19assert "prompt" not in record
20print("audit_fields:", sorted(record))1audit_fields: ['latency_bucket', 'model', 'tenant']For enterprise usage, client-side redaction can reduce the chance that recognized secrets leave the developer's machine. Pattern and entropy-based scanners can catch some API keys, tokens, and passwords, while policy-based filters can mask selected identifiers before prompt construction. Scanners are incomplete, so redaction complements upload controls, restricted logging, access control, and incident response.
If a policy promises that matching secret patterns won't be uploaded, replacement must run before upload. The system replaces matched strings with placeholders such as <API_KEY> before constructing the remote prompt.
When the server returns a generated completion, the client should only reinsert masked values if the placeholder maps to a known local value. Otherwise it should keep the placeholder visible and require an explicit user edit. That prevents the model from inventing a secret-looking string and having the client silently treat it as real.
1import re
2
3TOKEN = re.compile(r"sk_live_[A-Za-z0-9]+")
4
5def redact(text: str) -> tuple[str, dict[str, str]]:
6 mapping: dict[str, str] = {}
7 def replace(match: re.Match[str]) -> str:
8 placeholder = f"<API_KEY_{len(mapping) + 1}>"
9 mapping[placeholder] = match.group(0)
10 return placeholder
11 return TOKEN.sub(replace, text), mapping
12
13prompt, local_mapping = redact("client = API('sk_live_abc123')")
14assert "sk_live_" not in prompt
15assert local_mapping["<API_KEY_1>"] == "sk_live_abc123"
16print("upload_prompt:", prompt)1upload_prompt: client = API('<API_KEY_1>')Some enterprise policies prohibit sending source code to a shared external service. A product serving those customers may need private Virtual Private Cloud (VPC), self-hosted, or offline deployment options.
A balanced look at code completion must address the potential downsides. A common concern is the "mastery gap": if developers accept suggestions without understanding them, their debugging and design instincts can weaken over time.
On the positive side, code completion cuts repetitive boilerplate, lowers the memorization burden of syntax and API details, and acts as live documentation for unfamiliar libraries. It also makes programming more approachable for newcomers and non-native English speakers.
On the negative side, the model can hallucinate libraries that don't exist or emit deprecated syntax. It might accidentally suggest code with security vulnerabilities, or reproduce copyrighted fragments from its training data, creating legal exposure. Telemetry and shared training pipelines can also leak proprietary code if the privacy architecture is weak. Finally, over-reliance is a real risk: junior developers who stop learning how the code works may find their debugging and design instincts weakening over time.
You should be able to defend these design choices clearly:
Symptom: Suggestions appear after the user already typed past them. Cause: No real cancellation path, or UI trusts arrival order instead of request IDs. Fix: Abort in-flight work when possible and gate rendering on newest request ID.
Symptom: Completions are often exact but still feel unhelpful. Cause: System optimizes for raw acceptance with tiny safe suggestions. Fix: Track accepted-and-retained characters and shown rate, not acceptance alone.
Symptom: Member completion is slower and less accurate after . than IDE autocomplete used to be.
Cause: Every keystroke is routed to the LLM instead of keeping semantic lane for deterministic cases.
Fix: Let parser or language server own exact symbol completion and reserve GPU work for open-ended spans.
Symptom: Prefix caching hit rate stays low even though users edit the same file repeatedly. Cause: Routing breaks shard affinity, so matching prefixes miss the cached KV blocks. Fix: Add prefix-aware routing keyed by tenant, model, and stable prompt prefix.
Symptom: Inserted code fights the code below the cursor. Cause: System ignores suffix context or uses left-to-right continuation where infill is required. Fix: Use FIM prompt formatting and FIM-trained models for in-file edits.
Symptom: Model quality looks strong in offline code benchmarks but users still dislike the product. Cause: Benchmarks miss stale-response behavior, latency tails, abstention policy, and IDE interaction friction. Fix: Pair offline evals with online product metrics and real editor A/B tests.
Code completion changes part of the user's work from typing to reviewing suggestions. Useful systems gather bounded context, reuse permitted computation, suppress stale output, and measure whether retained edits justify their latency and data access.
(Prefix, Suffix) context to generate a fitting middle span.[4]You have now designed a low-latency, context-aware code completion system. The principles of hierarchical context construction, KV-cache prefix reuse, debouncing + cancellation, speculative decoding, and FIM training are the exact foundation for every high-frequency, low-latency inference surface you will build.
You can now defend a measured inline-completion design: context selection, infill formatting, freshness gates, prefill reuse, and privacy contracts. The next capstone turns that single-product serving path into a shared multi-tenant GPU platform with isolation, budgets, and safe rollouts.
GitHub Copilot code suggestions in your IDE
GitHub · 2026
GitHub Copilot cloud agent
GitHub · 2026
Language Server Protocol
Microsoft · 2026
Efficient Training of Language Models to Fill in the Middle.
Bavarian, M., et al. · 2022 · arXiv preprint
The Probabilistic Relevance Framework: BM25 and Beyond.
Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023
Speculative Decoding
vLLM Team · 2026 · vLLM Documentation
Automatic Prefix Caching
vLLM · 2026
GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers
Frantar, E., et al. · 2023 · ICLR 2023
GPTQ
Hugging Face · 2026
Programmatic Language Features: Show Inline Completions
Visual Studio Code · 2026
Measuring GitHub Copilot's Impact on Productivity
Ziegler, A., et al. · 2024
Evaluating Large Language Models Trained on Code (HumanEval).
Chen, M., et al. · 2021 · arXiv preprint
Qwen2.5-Coder Technical Report
Qwen Team, Alibaba Group · 2024 · arXiv preprint
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Text Generation Inference.
Hugging Face · 2026