Master the architecture of an end-to-end AI search engine, covering freshness routing, hybrid retrieval, evidence packing, citation verification, and streaming synthesis.
Multi-tenant serving made tenant identity, cost, and visibility part of every request. Large language model (LLM)-powered search applies that discipline to evidence: retrieval, ranking, caches, citations, and generated claims must all respect freshness and authorization before the model writes an answer.
You are the Staff AI Engineer at LogiFlow. Customers type questions such as "warm waterproof jacket with side pockets under $120, same-day to 94107" and "what happens if my cross-border order arrives damaged?" A generative search experience must retrieve allowed policies, product pages, and merchant terms, synthesize a cited answer, and keep private promotions and inventory outside unauthorized contexts.
An LLM-powered search engine combines retrieval, ranking, synthesis, and citation into a user-facing answer system. The core engineering challenge is balancing retrieval and generation latency with factual accuracy, freshness, and merchant isolation.
Building a search engine that generates answers rather than just lists of links requires orchestrating retrieval, , and LLM synthesis inside an interactive latency budget. In this LogiFlow case study, we will design that system around evidence and measurable service objectives.
If the ideas of embeddings or feel rusty, here's a quick reminder: embeddings turn sentences into dense vectors so similar meanings sit close together in math space, and chunking splits long documents into short, self-contained passages so retrieval can grab the right snippet. Both are prerequisites for everything that follows.
Traditional search points users to candidate documents. LLM-powered search retrieves policy pages, product docs, and event records, then returns a cited answer with a support state. This relies on Retrieval-Augmented Generation (RAG)[1], which conditions generation on retrieved evidence rather than treating model memory as the source of truth.
Traditional search leaves synthesis to the user. Generative search performs that synthesis, which means it also owns citation binding, abstention, and authorization failures.
This capstone uses a practical answer-engine architecture: retrieve with keyword and signals, rerank candidates, apply visibility and freshness policy, then synthesize a short cited answer whose claims can be checked.
| Feature | Traditional Search | Generative Search |
|---|---|---|
| Output | Ranked links or snippets | Answer plus cited evidence |
| Interaction | Query and result navigation | Answer and optional follow-up |
| User validation | User opens sources directly | User must be able to inspect cited sources |
| Source tracking | Source page is the result | Claim-to-source binding is an application duty |
Before selecting components, define performance and functional constraints. In an interview, be explicit that these are example SLOs for a consumer-facing product, not universal constants.
| Dimension | Constraint | Example target |
|---|---|---|
| Latency | Time to first token (TTFT) | ~0.8-1.5 seconds |
| Throughput | Peak queries per second (QPS) | 1K-10K QPS per region |
| Quality | Answer accuracy | High citation precision, low unsupported claims |
| Cost | Average query cost | Low single-digit cents at scale |
| Freshness | Information latency | Minutes for news, hours for general |
The architecture has four query-time layers: plan the query, retrieve candidate evidence, pack the best evidence into context, and generate an answer while verifying citations.
Separate from this online path, a background ingestion system crawls or receives source updates, cleans them, chunks them, and keeps sparse and dense indexes fresh. Without that ingest path, an "AI search engine" quickly becomes a prompt wrapper around stale data.
To make the pipeline concrete, let's follow the same question through every stage: "Which return policy applies to a damaged cross-border order?"
| Stage | What the system does |
|---|---|
| Query plan | The router classifies the question as a comparison needing both policy docs and carrier exceptions. It decomposes the query into sub-queries for "damaged item return policy" and "cross-border shipping exceptions." |
| Retrieval | BM25 (Best Match 25, a keyword-scoring algorithm) pulls the exact "Return Policy" and "Cross-border Shipping" pages from the doc store. Dense retrieval finds a related FAQ about damaged-item workflows. |
| Reranking | The cross-encoder scores the candidate passages and keeps the top 3 that mention both damage and international shipping. |
| Synthesis | The LLM receives the packed passages with source IDs and writes a two-sentence answer with inline citations like [1] and [2]. |
| Verification | A verifier returns support labels for answer claims against cited text. Product policy decides whether pending or unsupported claims are hidden, marked, or regenerated. |
This trace will reappear as we look at each component in detail.
The first step is classifying intent and deciding whether retrieval needs rewriting. A large synthesis model can perform this task, but using it on every query adds measured latency and cost that many simple routes do not need.
A common approach uses specialized, smaller models or encoder-class classifiers for routing, rewriting, and decomposition. Benchmark the chosen planner on your hardware and traffic shape; its job is to distinguish a simple fact lookup from an exploratory or freshness-sensitive request and attach retrieval hints such as "prefer exact matches" or "fresh data required."
One useful trick is HyDE (Hypothetical Document Embeddings): generate a short hypothetical answer, embed that answer, and use it as an alternate dense-search query[2]. This often improves recall for abstract or underspecified questions where the raw query is a poor embedding target.
Not every query should hit the same retriever. Exact, structured questions often route to APIs, SQL, or an entity store first, then optionally use the LLM only for verbalization. Treating every request like unstructured RAG adds latency and often makes numeric answers less reliable.
| Query pattern | Best first backend | Why |
|---|---|---|
| "Where is order A102 right now?" | Live API | Freshness and exact values matter more than semantic similarity |
| "Find the damaged-item return policy" | BM25 + doc store | Exact title and keyword match dominate |
| "Compare standard vs expedited carrier SLA exceptions" | Hybrid sparse + dense | Needs multi-document synthesis and terminology overlap |
| "What changed in the latest merchant policy update?" | Fresh web / hot index | Recency matters, then reranking decides quality |
| Component | Purpose | Example |
|---|---|---|
| Intent Classifier | Route to appropriate pipeline | "order A102 status" → direct API call |
| Query Decomposer | Break complex queries into sub-queries | "Compare carrier SLA exceptions for fragile items" → 3 sub-queries |
| Safety Filter | Block harmful queries early | Content policy enforcement |
| Freshness Detector | Determine if real-time data is needed | "latest news on X" → force fresh crawl |
The following Python code demonstrates how to structure a low-latency query planner. It takes the raw user query as input and returns a structured execution plan, including classified intent and decomposed sub-queries. In production, the decision can come from a small classifier, rules plus embeddings, or a compact model rather than a frontier synthesis model.
1import json
2from dataclasses import asdict, dataclass, field
3
4@dataclass(frozen=True)
5class QueryPlan:
6 intent: str
7 route: str
8 sub_queries: list[str] = field(default_factory=list)
9 filters: dict[str, str | bool] = field(default_factory=dict)
10 needs_fresh_data: bool = False
11 complexity: str = "simple"
12
13def understand_query(user_query: str) -> QueryPlan:
14 """Plan with fast routing logic; swap rules for a small classifier in production."""
15 text = user_query.lower()
16
17 if any(term in text for term in ("where is order", "inventory", "price")):
18 return QueryPlan(
19 intent="factual",
20 route="structured_api",
21 filters={"requires_live_data": True},
22 needs_fresh_data=True,
23 )
24
25 if "cross-border" in text and "damaged" in text:
26 return QueryPlan(
27 intent="comparison",
28 route="hybrid_search",
29 sub_queries=[
30 "damaged item return policy",
31 "cross-border shipping exceptions",
32 ],
33 filters={"doc_type": "policy", "tenant_scoped": True},
34 complexity="moderate",
35 )
36
37 return QueryPlan(
38 intent="exploratory",
39 route="hybrid_search",
40 sub_queries=[user_query],
41 complexity="moderate",
42 )
43
44plan = understand_query("Which return policy applies to a damaged cross-border order?")
45print(json.dumps(asdict(plan), indent=2))1{
2 "intent": "comparison",
3 "route": "hybrid_search",
4 "sub_queries": [
5 "damaged item return policy",
6 "cross-border shipping exceptions"
7 ],
8 "filters": {
9 "doc_type": "policy",
10 "tenant_scoped": true
11 },
12 "needs_fresh_data": false,
13 "complexity": "moderate"
14}We use a hybrid retrieval strategy to maximize recall before refining for precision. A single retrieval method is rarely sufficient for complex queries. Relying only on keyword search might miss documents that use synonyms, while relying entirely on dense vector embeddings (encoding data as points in a high-dimensional space so related concepts are near each other) can overlook exact matches for specific names, product IDs, or acronyms.
By designing a multi-stage pipeline, we capture the best of both approaches. The first stage runs sparse and dense retrieval in parallel, then fuses the ranked lists with Reciprocal Rank Fusion (RRF)[3]. The second stage applies a more expensive reranker (often a cross-encoder or late-interaction model[4]) to a much smaller candidate set, passing only the most relevant chunks or parent passages to the LLM.
The figure below outlines the flow of this multi-stage retrieval process.
BM25 (Best Match 25, a reliable keyword search algorithm) excels at exact keyword matching; dense retrieval captures semantic similarity. Dense retrieval typically uses bi-encoders (which encode queries and documents separately for fast lookup), though more advanced systems may use architectures like ColBERT (Contextualized Late Interaction over BERT) for late interaction between query and document tokens[4]. Together, they achieve higher recall than either alone.
In practice, hybrid search is not just "run two retrievers." You also need rank fusion, chunking, and evidence assembly. A common pattern is to embed short child chunks for recall, then retrieve the larger parent passage for synthesis so the model sees enough surrounding context to answer cleanly.
Why fuse on rank instead of raw score? BM25 and dense-similarity scores have different scales, so adding them without calibration is brittle. Reciprocal Rank Fusion scores documents from rank positions: score(d) = sum over lists of 1 / (k + rank(d)). Cormack et al. evaluated RRF with k = 60; treat that value as a starting point to validate, not a universal optimum.[3] Apply authorization filtering before fusion so a high-ranked forbidden document never becomes generation context.
1from collections import defaultdict
2
3tenant = "merchant-a"
4acl = {"returns": {"merchant-a"}, "carrier": {"merchant-a"}, "secret-rate": {"merchant-b"}}
5sparse_ranked = ["returns", "secret-rate", "carrier"]
6dense_ranked = ["secret-rate", "carrier", "returns"]
7
8def allowed_ranked(document_ids: list[str]) -> list[str]:
9 return [document_id for document_id in document_ids if tenant in acl[document_id]]
10
11def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
12 scores: defaultdict[str, float] = defaultdict(float)
13 for ranking in rankings:
14 for rank, document_id in enumerate(ranking, start=1):
15 scores[document_id] += 1 / (k + rank)
16 return sorted(scores, key=scores.get, reverse=True)
17
18fused = reciprocal_rank_fusion([allowed_ranked(sparse_ranked), allowed_ranked(dense_ranked)])
19assert "secret-rate" not in fused
20print("authorized_fused_docs:", fused)1authorized_fused_docs: ['returns', 'carrier']Real search products rarely use one monolithic index. A useful pattern is a hot-warm split:
| Tier | Typical content | Update pattern | Query behavior |
|---|---|---|---|
| Hot | News, changelogs, live API snapshots | Seconds to minutes | Queried when freshness detector fires |
| Warm | Product docs, internal wikis, help center content | Minutes to hours | Default hybrid search tier |
| Cold | Stable archive and long-tail corpus | Batch compaction | Backfill recall when warm tier misses |
Freshness should influence ranking, not replace relevance. A common pattern is to combine retrieval score with time decay, then let the reranker decide whether the newer document is actually better. If you rank purely by recency, a low-quality post from five minutes ago can outrank the canonical source.
TTFT is a critical constraint of this system. The values below are an illustrative budget; validate them against product objectives and measured stage timings.
| Stage | Example latency | Method | Parallelism |
|---|---|---|---|
| Query planning | 10-50ms | Small router / rewriter | Can overlap with initial retrieval |
| Retrieval and fetch | 150-500ms | Sparse, dense, API, or web calls | Fan-out within allowed routes |
| Reranking | 50-150ms | Cross-encoder GPU batch | Batch allowed candidates together |
| Evidence packing | 10-30ms | Dedupe, trim, retain source IDs | Sequential after ranking |
| LLM prefill to first token | 300-800ms | Streaming generation | Sequential after context exists |
| Total to first token | ~0.8-1.5 seconds | Example product objective | Depends on overlap and tail latency |
Note: If you support speculative retrieval, cheap lexical or dense search can begin before rewrite finishes. If you don't, planning and retrieval stack sequentially. Make that decision explicit when you budget TTFT.
1planner_ms = 40
2initial_retrieval_ms = 260
3rerank_ms = 90
4pack_ms = 20
5prefill_to_first_token_ms = 430
6
7sequential_ttft_ms = planner_ms + initial_retrieval_ms + rerank_ms + pack_ms + prefill_to_first_token_ms
8overlapped_ttft_ms = max(planner_ms, initial_retrieval_ms) + rerank_ms + pack_ms + prefill_to_first_token_ms
9
10assert sequential_ttft_ms == 840
11assert overlapped_ttft_ms == 800
12print("sequential_ttft_ms:", sequential_ttft_ms)
13print("overlapped_ttft_ms:", overlapped_ttft_ms)1sequential_ttft_ms: 840
2overlapped_ttft_ms: 800The synthesis prompt can require citations and abstention, but prompting cannot prove grounding. The product must keep source IDs attached to evidence, check generated claims, and decide what may be shown while support is pending or missing.
Grounding is not just a prompt problem. Evidence packing matters too. Put the strongest chunks first, leave room for the model to answer, and avoid stuffing dozens of mediocre passages into the middle of a huge prompt. Long-context models still show "lost in the middle" behavior, where evidence in the center of the context window gets used less reliably[5]. That is one more reason reranking and tight top-K beat dumping everything into a large context window.
The citations a user sees can be a subset of retrieved evidence, but each visible citation must bind to a specific claim and its source span. Do not treat a source list as proof that every generated sentence is supported.
Before generation, pack allowed passages into a bounded context while preserving source IDs:
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Passage:
5 source_id: int
6 tenant: str
7 text: str
8 tokens: int
9 rerank_score: float
10
11def pack_allowed(passages: list[Passage], tenant: str, budget: int) -> list[Passage]:
12 selected: list[Passage] = []
13 used = 0
14 for passage in sorted(passages, key=lambda item: item.rerank_score, reverse=True):
15 if passage.tenant != tenant or used + passage.tokens > budget:
16 continue
17 selected.append(passage)
18 used += passage.tokens
19 return selected
20
21candidates = [
22 Passage(1, "merchant-a", "Damaged returns start within 14 days.", 7, 0.95),
23 Passage(2, "merchant-b", "Private discount schedule.", 4, 0.99),
24 Passage(3, "merchant-a", "Carrier inspection may apply.", 5, 0.82),
25]
26packed = pack_allowed(candidates, tenant="merchant-a", budget=12)
27assert [passage.source_id for passage in packed] == [1, 3]
28print("packed_source_ids:", [passage.source_id for passage in packed])1packed_source_ids: [1, 3]Keeping temperature low reduces answer variance, but it does not create factuality. The next deterministic example represents a support gate around generated claims: a real synthesizer can draft text, while publication depends on evidence checks.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Source:
5 source_id: int
6 title: str
7 content: str
8
9def contains_all(source: Source, required_terms: tuple[str, ...]) -> bool:
10 text = source.content.lower()
11 return all(term in text for term in required_terms)
12
13def grounded_answer(sources: list[Source]) -> str:
14 by_id = {source.source_id: source for source in sources}
15 if not contains_all(by_id[1], ("damaged cross-border", "within 14 days")):
16 return "Insufficient cited evidence to answer."
17 if not contains_all(by_id[2], ("carrier handoff", "inspection")):
18 return "Insufficient cited evidence to answer."
19 return (
20 "Initiate the damaged cross-border return within 14 days [1]. "
21 "Damage after carrier handoff may require carrier inspection [2]."
22 )
23
24sources = [
25 Source(
26 source_id=1,
27 title="International Returns Portal",
28 content="Damaged cross-border orders must be initiated within 14 days.",
29 ),
30 Source(
31 source_id=2,
32 title="Carrier Exception Clauses",
33 content="Damage after carrier handoff can require carrier inspection.",
34 ),
35]
36
37answer = grounded_answer(sources)
38assert grounded_answer(sources[:1] + [Source(2, "Carrier Exception Clauses", "No inspection rule.")]) == "Insufficient cited evidence to answer."
39print(answer)1Initiate the damaged cross-border return within 14 days [1]. Damage after carrier handoff may require carrier inspection [2].For one-way progressive delivery, Server-Sent Events (SSE) work through the browser EventSource API over HTTP and are straightforward to render[6]. Streaming creates a product-policy choice: either hold factual claims until verification completes, or label streamed text as provisional and update or retract unsupported claims.
Here is an event contract for a low-risk answer path that permits provisional rendering. It emits the plan and sources, declares pending support, streams text, then attaches the result of claim checks. A higher-risk workflow can omit text frames until support becomes supported.
1import json
2
3def sse(event_type: str, data: object) -> str:
4 payload = {"type": event_type, "data": data}
5 return f"data: {json.dumps(payload, separators=(',', ':'))}\n\n"
6
7def event_stream() -> list[str]:
8 return [
9 sse("plan", {"intent": "comparison", "route": "hybrid_search"}),
10 sse("sources", [{"title": "Return Policy", "source_id": 1}]),
11 sse("support_state", {"state": "pending", "display": "provisional"}),
12 sse("text", "For a damaged cross-border order"),
13 sse("text", ", initiate the return within 14 days [1]."),
14 sse(
15 "support_state",
16 {"state": "supported", "claim_ids": ["return-window"]},
17 ),
18 "data: [DONE]\n\n",
19 ]
20
21for frame in event_stream():
22 print(frame, end="")1data: {"type":"plan","data":{"intent":"comparison","route":"hybrid_search"}}
2
3data: {"type":"sources","data":[{"title":"Return Policy","source_id":1}]}
4
5data: {"type":"support_state","data":{"state":"pending","display":"provisional"}}
6
7data: {"type":"text","data":"For a damaged cross-border order"}
8
9data: {"type":"text","data":", initiate the return within 14 days [1]."}
10
11data: {"type":"support_state","data":{"state":"supported","claim_ids":["return-window"]}}
12
13data: [DONE]In production, add periodic heartbeat events and configure upstream proxies not to buffer the stream. Also log state transitions so provisional output that becomes blocked can be audited.
One of the most critical challenges in search is hallucination, when the LLM generates plausible but incorrect facts. Even with RAG, models can misinterpret complex details. Strong systems usually stack multiple checks rather than betting on one verifier:
| Method | What it checks | Relative cost | Main weakness |
|---|---|---|---|
| Quote / span matching | Does the cited span actually appear in the source? | Very low | Misses paraphrases and logical contradictions |
| Natural Language Inference (NLI) | Does the cited passage entail the claim? | Medium | Can struggle on ambiguous or very long passages |
| LLM-as-judge / fallback pass | Should unsupported claims be flagged or regenerated? | High | Requires measured latency, cost, and calibration |
NLI-based verification estimates whether a cited passage entails a claim rather than merely sharing words. TRUE[7] found that large-scale NLI methods and question-generation-plus-answering methods were strong automatic factual-consistency checks on its benchmarks, so NLI is a useful measured layer, not proof of correctness.
The function below is deliberately only a lexical precheck. It can cheaply reject a missing phrase; it cannot label entailment. Claims that pass it still need a calibrated verifier or a policy that keeps them provisional.
1import json
2
3def lexical_precheck(claim: str, source_text: str) -> dict[str, str | bool]:
4 claim_terms = {"returns", "initiated", "within", "14", "days"}
5 source_terms = set(source_text.lower().replace(".", "").split())
6 phrase_match = claim_terms.issubset(source_terms)
7
8 return {
9 "claim": claim,
10 "phrase_match": phrase_match,
11 "next_step": "entailment_check" if phrase_match else "block_claim",
12 }
13
14result = lexical_precheck(
15 claim="Returns must be initiated within 14 days",
16 source_text="Damaged cross-border returns must be initiated within 14 days.",
17)
18
19print(json.dumps(result, indent=2))1{
2 "claim": "Returns must be initiated within 14 days",
3 "phrase_match": true,
4 "next_step": "entailment_check"
5}By the end of this capstone, you should be able to defend these design choices:
| Skill | What good looks like |
|---|---|
| Offline vs online split | You separate ingestion, cleanup, chunking, and index refresh from the latency-sensitive query path. |
| Multi-stage query path | You can explain planning, hybrid retrieval, reranking, evidence packing, synthesis, and verification as distinct stages. |
| Recall improvements | You know when to use query rewriting, decomposition, routing, and HyDE-style hypothetical documents. |
| Citation verification | You check claims against cited sources with span checks, NLI, and selective fallback regeneration. |
| Streaming UX | You design progressive rendering with SSE events for plan, sources, text, and verification state. |
| Latency budget | You allocate TTFT across planning, retrieval, reranking, prefill, and generation instead of treating "LLM latency" as one number. |
| Serving levers | You know where PagedAttention and speculative decoding help, and why they do not replace retrieval discipline. |
| Evaluation loop | You combine RAGAS-style reference-free metrics, labeled retrieval metrics, citation metrics, and online feedback. |
How do you handle queries that need real-time data? Real-time data needs its own freshness pipeline. Use a hot index for newly crawled content plus direct routes to live APIs or structured stores for exact numeric fields. When the planner marks a query as freshness-sensitive, search the hot tier first, apply recency-aware scoring, then merge with the warmer corpus so the LLM sees both new evidence and canonical background context.
How would you detect and handle hallucinated citations? Decompose the generated answer into atomic claims, attach each claim to its cited source, and verify entailment with NLI or a calibrated lighter check. If support is weak or contradictory, flag the statement, remove it, or regenerate with stricter grounding and a smaller evidence set.
What's your latency budget for each pipeline stage? For a consumer-facing experience, a reasonable example SLO is roughly 0.8-1.5 seconds to first token. Query planning often fits in 10-50ms, retrieval fan-out in 150-500ms, reranking in 50-150ms, evidence packing in 10-30ms, and synthesis in roughly 300-800ms to first token. Split synthesis into prefill and decode because prompt length drives prefill cost while answer length drives decode cost.
How do you evaluate answer quality at scale? Start with reference-free metrics such as faithfulness, answer relevance, and context relevance. Add labeled retrieval metrics such as Recall@k or nDCG on benchmark sets, plus citation precision and unsupported-claim rate. In production, track citation click-through, reformulation rate, dwell time, abandonment, and explicit user feedback.
| Symptom | Cause | Fix |
|---|---|---|
| TTFT is consistently >2 seconds | Running a frontier LLM on every query, even simple lookups | Add a lightweight router that sends factual queries to a small model or cache |
| Answers cite the wrong source | Evidence packing stuffs too many mediocre passages into the prompt | Reduce top-K after reranking and place the strongest chunks at the start and end of context |
| Hallucinations persist despite RAG | Missing claim-level verification; relying only on prompt instructions | Stack span matching, NLI, and selective LLM-as-judge checks |
| Cost per query spikes unpredictably | No freshness tiering; every query hits real-time web retrieval | Route only freshness-marked queries to the hot tier and cache stable answers |
| Retrieval misses exact SKUs or order IDs | Using only dense semantic search without keyword fallback | Hybrid sparse + dense retrieval with BM25 for exact matches |
Building a search engine at scale requires careful cost management. Exact dollar figures drift with vendor pricing, model choice, and answer length, so reason from units of work rather than one fixed price table.
| Cost center | Billable unit | What makes it expensive |
|---|---|---|
| Web retrieval | API calls and fetched documents | More sub-queries, more URLs fetched |
| Reranking | Query-document pairs | Larger candidate set or heavier reranker |
| Synthesis | Prompt + completion tokens | Long context windows and long answers |
| Verification | Claims checked against passages | Dense citation requirements, many claims |
Before writing the formula, use scenario inputs rather than presenting provider pricing as fixed:
For these inputs, total cost is roughly $0.01 per query. At 1,000 QPS, that scenario approaches $600 per minute, which is why routing and caching need their own budgets.
1retrieval_usd = 0.002
2rerank_usd = 0.001
3prompt_tokens = 1500
4output_tokens = 400
5token_usd_per_thousand = 0.003
6verification_usd = 0.001
7queries_per_second = 1000
8
9query_usd = (
10 retrieval_usd
11 + rerank_usd
12 + (prompt_tokens + output_tokens) / 1000 * token_usd_per_thousand
13 + verification_usd
14)
15per_minute_usd = query_usd * queries_per_second * 60
16
17assert round(query_usd, 4) == 0.0097
18print("scenario_query_usd:", round(query_usd, 4))
19print("scenario_per_minute_usd:", round(per_minute_usd, 2))1scenario_query_usd: 0.0097
2scenario_per_minute_usd: 582.0Which stage dominates variable cost depends on provider rates, fan-out, prompt length, and verification policy. Measure the breakdown, then route cacheable or structured queries away from unnecessary synthesis and trim top-K where retrieval evaluation allows it.
TTFT can be dominated by retrieval, prompt assembly, or prefill; long answers can shift pressure toward decode. Profile the critical path before choosing an optimization.
PagedAttention improves KV-cache memory management and can increase concurrency when KV allocation is the binding constraint.[8] If decode is the bottleneck, speculative decoding can reduce latency when a draft model is accepted often enough; benchmark it for the answer shapes and target model in this service.[9]
These serving optimizations matter only after retrieval depth and prompt length are under control. If you keep sending 40 mediocre passages to the model, no amount of serving cleverness will rescue TTFT.
For a scenario with 1K QPS sustained and 1.5 seconds average latency per request:
This is Little's Law (average items in system = arrival rate × average time in system) in practice: at 1000 requests/second with 1.5-second average service time, around 1500 requests are in-flight simultaneously.
1from math import ceil
2
3queries_per_second = 1000
4average_latency_seconds = 1.5
5average_output_tokens = 120
6effective_decode_tokens_per_second_per_gpu = 6000
7
8in_flight_requests = queries_per_second * average_latency_seconds
9decode_gpus = ceil(
10 queries_per_second
11 * average_output_tokens
12 / effective_decode_tokens_per_second_per_gpu
13)
14
15assert in_flight_requests == 1500
16print("in_flight_requests:", int(in_flight_requests))
17print("decode_gpus_before_headroom:", decode_gpus)1in_flight_requests: 1500
2decode_gpus_before_headroom: 20The next step is to translate concurrency into throughput and memory requirements for your actual serving stack. Don't assume a universal "requests per GPU" number. Measure your own model, context length, batching strategy, and quantization settings. PagedAttention improves packing efficiency and reduces KV-cache fragmentation under concurrent load[8], but capacity still depends on empirical tokens-per-second benchmarks.
For long prompts, also budget prefill separately:
In practice, size the system against whichever pool saturates first: retriever GPUs, reranker GPUs, prefill workers, or decode workers. The answer model may be an expensive tier, but measured retrieval or reranking can still be the first bottleneck.
RAGAS[10] introduced reference-free evaluation around faithfulness, answer relevance, and context relevance. It is one offline layer when gold answers are scarce, not a replacement for labeled retrieval metrics, citation checks, or online product metrics: a response can be consistent with mediocre evidence and still fail the user.
| Layer | Metric | What it catches | Typical requirement |
|---|---|---|---|
| Reference-free offline | Faithfulness | Unsupported claims | LLM/NLI judge over answer vs context |
| Reference-free offline | Answer relevance | Missed user intent | Generated-question similarity or judge |
| Reference-free offline | Context relevance | Noisy, redundant retrieved context | Judge over question vs retrieved passages |
| Labeled offline | Recall@k / nDCG / MRR | Retriever missed or buried the right source | Gold passages or reliable click labels |
| Citation quality | Citation precision / unsupported-claim rate | Wrong citations or uncited claims | Claim extraction + source verification |
| Online | Reformulation rate / citation CTR / abandonment | Real user dissatisfaction | Production logging and experiments |
The code below wires a scorecard together with simple token-overlap proxies so it stays runnable. Those proxies are not RAGAS or calibrated faithfulness judges; replace them in an evaluation service while keeping labeled retrieval metrics separate.
1import json
2from dataclasses import dataclass
3from collections.abc import Mapping
4from math import log2
5
6@dataclass(frozen=True)
7class Passage:
8 id: str
9 text: str
10
11def tokens(text: str) -> set[str]:
12 return set(text.lower().replace(",", "").replace(".", "").split())
13
14def overlap(a: str, b: str) -> float:
15 left = tokens(a)
16 right = tokens(b)
17 return len(left & right) / max(1, len(left))
18
19def citation_precision(answer: str, cited_passages: list[Passage]) -> float:
20 cited_ids = [passage.id for passage in cited_passages if f"[{passage.id}]" in answer]
21 return len(cited_ids) / max(1, answer.count("["))
22
23def recall_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int) -> float:
24 return len(set(retrieved_ids[:k]) & gold_ids) / max(1, len(gold_ids))
25
26def ndcg_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int) -> float:
27 gains = [1.0 if item in gold_ids else 0.0 for item in retrieved_ids[:k]]
28 dcg = sum(gain / log2(rank + 2) for rank, gain in enumerate(gains))
29 ideal = sum(1.0 / log2(rank + 2) for rank in range(min(len(gold_ids), k)))
30 return dcg / ideal if ideal else 0.0
31
32def score_search_response(
33 query: str,
34 answer: str,
35 retrieved_passages: list[Passage],
36 cited_passages: list[Passage],
37 gold_passage_ids: set[str] | None = None,
38) -> Mapping[str, float]:
39 context = " ".join(passage.text for passage in retrieved_passages)
40 scorecard = {
41 "answer_context_token_overlap": overlap(answer, context),
42 "query_answer_token_overlap": overlap(query, answer),
43 "query_context_token_overlap": overlap(query, context),
44 "citation_precision": citation_precision(answer, cited_passages),
45 }
46
47 if gold_passage_ids is not None:
48 retrieved_ids = [p.id for p in retrieved_passages]
49 scorecard["recall_at_20"] = recall_at_k(retrieved_ids, gold_passage_ids, k=20)
50 scorecard["ndcg_at_10"] = ndcg_at_k(retrieved_ids, gold_passage_ids, k=10)
51
52 return {metric: round(value, 2) for metric, value in scorecard.items()}
53
54retrieved = [
55 Passage("1", "Damaged cross-border returns must be initiated within 14 days."),
56 Passage("2", "Carrier exception clauses may require carrier inspection."),
57 Passage("3", "Domestic exchanges follow a separate workflow."),
58]
59answer = "Damaged cross-border returns must be initiated within 14 days [1]."
60
61print(json.dumps(
62 score_search_response(
63 query="Which return policy applies to a damaged cross-border order?",
64 answer=answer,
65 retrieved_passages=retrieved,
66 cited_passages=retrieved[:1],
67 gold_passage_ids={"1", "2"},
68 ),
69 indent=2,
70))1{
2 "answer_context_token_overlap": 0.9,
3 "query_answer_token_overlap": 0.22,
4 "query_context_token_overlap": 0.33,
5 "citation_precision": 1.0,
6 "recall_at_20": 1.0,
7 "ndcg_at_10": 1.0
8}As traffic grows, evaluate each latency or cost lever against quality and authorization tests:
| Strategy | Implementation Details | Primary Benefit |
|---|---|---|
| Semantic response cache | Key answers by authorization scope and evidence version; reuse only after similarity and freshness thresholds pass. | Can avoid synthesis work for eligible repeat queries. |
| Model routing | Route exact live facts to structured backends and measured simple paths to smaller models. | Can lower cost without weakening answer routes that require synthesis. |
| Speculative retrieval | Start an allowed cheap retriever while rewrite or decomposition runs, then merge or cancel. | Can shorten measured critical path when early retrieval is useful. |
| Asynchronous verification | Mark streamed claims pending, or hold high-risk facts until checks return. | Trades display latency against unsupported-claim exposure explicitly. |
| Geographic distribution | Place permitted retrieval and caches near users when data residency policy allows it. | Can reduce network latency for eligible data paths. |
| Feedback integration | Log interactions with privacy controls and use labeled review before training rerankers. | Supplies evidence for retriever or reranker changes. |
Symptom: Answers sound polished but use stale, missing, or unauthorized evidence. Cause: Retrieval, routing, and verification were collapsed into one generation step. Fix: Split the system into planner, retriever, reranker, evidence packer, synthesizer, and verifier. Keep trust boundaries outside the model.
Symptom: Search misses exact SKUs, order IDs, or policy titles even though semantically related passages appear. Cause: Dense retrieval is good at paraphrases, not exact lexical anchors. Fix: Use hybrid sparse plus dense retrieval, then fuse ranks before reranking.
Symptom: The right document was retrieved, but the final answer still ignores or misstates it. Cause: Dumping many mediocre passages into context creates prompt bloat and lost-in-the-middle effects. Fix: Rerank aggressively, keep source IDs attached, and pack only the strongest evidence the answer model needs.
Symptom: Answers show footnotes but still contain unsupported or misleading claims. Cause: The system attached sources without checking whether each claim is actually entailed by the cited text. Fix: Add claim extraction plus span checks, NLI, and selective fallback regeneration for weak support.
Symptom: Query cost spikes even after switching to a cheaper model. Cause: Retrieval fan-out, rerank pairs, long prompts, and verification work were left uncontrolled. Fix: Route simple queries earlier, trim top-K harder, cache safely, and reserve frontier synthesis for genuinely multi-hop requests.
Now that you can design a multi-stage search pipeline, the next step is to connect language with vision. In Vision-Language Models & CLIP, you'll learn how contrastive pre-training aligns text and image embeddings, how visual token budgets work, and how modern VLMs like LLaVA and Qwen-VL extend the same retrieval-and-generation ideas to multimodal search.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Lewis, P., et al. · 2020 · NeurIPS 2020
Precise Zero-Shot Dense Retrieval without Relevance Labels.
Gao, L., Ma, X., Lin, J., & Callan, J. · 2022 · arXiv preprint
Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.
Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. · 2023 · TACL 2023
Using server-sent events
MDN Web Docs · 2026
TRUE: Re-evaluating Factual Consistency Evaluation.
Honovich, O., et al. · 2022 · NAACL 2022
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023
RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Es, S., et al. · 2023 · arXiv preprint