LearnSystem Design CapstonesLLM-Powered Search Engine

🏗️HardSystem Design

LLM-Powered Search Engine

Master the architecture of an end-to-end AI search engine, covering freshness routing, hybrid retrieval, evidence packing, citation verification, and streaming synthesis.

35 min read

Learning path

Step 149 of 158 in the full curriculum

Multi-Tenant LLM Platform Vision-Language Models & CLIP

Multi-tenant serving made tenant identity, cost, and visibility part of every request. Large language model (LLM)-powered search applies that discipline to evidence: retrieval, ranking, caches, citations, and generated claims must all respect freshness and authorization before the model writes an answer.

You're the Staff AI Engineer at CodeAtlas. Developers type questions such as "can service accounts rotate OAuth tokens every 90 days?" and "what happens if parser-v2 deprecation affects our SDK migration?" A generative search experience must retrieve allowed API docs, release notes, and workspace policies, synthesize a cited answer, and keep private roadmaps and incident notes outside unauthorized contexts.

An LLM-powered search engine combines retrieval, ranking, synthesis, and citation into a user-facing answer system. The core engineering challenge is balancing retrieval and generation latency with factual accuracy, freshness, and workspace isolation.

Building a search engine that generates answers rather than just lists of links requires orchestrating retrieval, reranking, and LLM synthesis inside an interactive latency budget. This CodeAtlas case study designs that system around evidence and measurable service objectives.

If the ideas of embeddings or chunking feel rusty, here's a quick reminder: embeddings turn sentences into dense vectors so similar meanings sit close together in math space, and chunking splits long documents into short, self-contained passages so retrieval can grab the right snippet. Both are prerequisites for everything that follows.

From keywords to reasoning

Traditional search points users to candidate documents. LLM-powered search retrieves API docs, policy pages, release notes, and incident records, then returns a cited answer with a support state. This relies on Retrieval-Augmented Generation (RAG)^{[1]Reference 1Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.https://arxiv.org/abs/2005.11401}, which conditions generation on retrieved evidence rather than treating model memory as the source of truth.

Traditional search leaves synthesis to the user. Generative search performs that synthesis, which means it also owns citation binding, abstention, and authorization failures.

This capstone uses a practical answer-engine architecture: retrieve with keyword and embedding signals, rerank candidates, apply visibility and freshness policy, then synthesize a short cited answer whose claims can be checked.

Traditional vs LLM search

Feature	Traditional Search	Generative Search
Output	Ranked links or snippets	Answer plus cited evidence
Interaction	Query and result navigation	Answer and optional follow-up
User validation	User opens sources directly	User must be able to inspect cited sources
Source tracking	Source page is the result	Claim-to-source binding is an application duty

What the system must deliver

Before selecting components, define performance and functional constraints. These are example SLOs for this interactive developer-product case study, not universal constants.

Dimension	Constraint	Example target
Latency	Time to first token (TTFT)	~0.8-1.5 seconds
Throughput	Peak queries per second (QPS)	1K-10K QPS per region
Quality	Answer accuracy	High citation precision, low unsupported claims
Cost	Average query cost	Low single-digit cents at scale
Freshness	Information latency	Seconds to minutes for hot updates, hours for stable docs

How the pipeline works end to end

The architecture has four query-time layers: plan the query, retrieve candidate evidence, pack the best evidence into context, and generate an answer while verifying citations.

Separate from this online path, a background ingestion system crawls or receives source updates, cleans them, chunks them, and keeps sparse and dense indexes fresh. Without that ingest path, an "AI search engine" quickly becomes a prompt wrapper around stale data.

LLM search architecture with an offline build lane for source normalization and hybrid indexing, then an online query lane that retrieves through access control, reranks evidence, and returns a cited answer. — Build indexes offline. At query time, retrieve through access control, rerank evidence, and answer with citations.

Tracing a concrete query

Follow the same question through every stage of the pipeline: "Which deprecation policy applies to parser-v2 SDK migration for service accounts?"

Stage	What the system does
Query plan	The router classifies the question as a comparison needing both API deprecation policy and service-account migration rules. It decomposes the query into sub-queries for "parser-v2 deprecation policy" and "service-account SDK migration exceptions."
Retrieval	BM25 (Best Match 25, a keyword-scoring algorithm) pulls the exact "Parser v2 Deprecation" and "Service Account Migration" pages from the doc store. Dense retrieval finds a related FAQ about legacy SDK workflows.
Reranking	The cross-encoder scores the candidate passages and keeps the top 3 that mention both parser-v2 and service accounts.
Synthesis	The LLM receives the packed passages with source IDs and writes a two-sentence answer with inline citations like [1] and [2].
Verification	A verifier returns support labels for answer claims against cited text. Product policy decides whether pending or unsupported claims are hidden, marked, or regenerated.

This trace will reappear as we look at each component in detail.

Inside the pipeline

1. Query understanding

The first step is classifying intent and deciding whether retrieval needs rewriting. A large synthesis model can perform this task, but using it on every query adds measured latency and cost that many simple routes don't need.

A common approach uses specialized, smaller models or encoder-class classifiers for routing, rewriting, and decomposition. Benchmark the chosen planner on your hardware and traffic shape; its job is to distinguish a simple fact lookup from an exploratory or freshness-sensitive request and attach retrieval hints such as "prefer exact matches" or "fresh data required."

One useful trick is HyDE (Hypothetical Document Embeddings): generate a hypothetical relevant document, embed that document, and use its vector as an alternate dense-search query.^{[2]Reference 2Precise Zero-Shot Dense Retrieval without Relevance Labels.https://arxiv.org/abs/2212.10496} The generated text may contain false details, so HyDE is a retrieval technique, not answer evidence. Validate whether it improves recall for your query mix.

Routing targets

Not every query should hit the same retriever. Exact, structured questions often route to APIs, SQL, or an entity store first, then optionally use the LLM only for verbalization. Treating every request like unstructured RAG adds latency and often makes numeric answers less reliable.

Query pattern	Best first backend	Why
"What is deploy job build-1042 doing right now?"	Live API	Freshness and exact values matter more than semantic similarity
"Find the parser-v2 deprecation policy"	BM25 + doc store	Exact title and keyword match dominate
"Compare OAuth rotation exceptions for service accounts"	Hybrid sparse + dense	Needs multi-document synthesis and terminology overlap
"What changed in the latest SDK migration guide?"	Fresh web / hot index	Recency matters, then reranking decides quality

Component	Purpose	Example
Intent Classifier	Route to appropriate pipeline	"build-1042 status" -> direct API call
Query Decomposer	Break complex queries into sub-queries	"Compare service-account rotation exceptions" -> 3 sub-queries
Safety Filter	Block harmful queries early	Content policy enforcement
Freshness Detector	Determine if real-time data is needed	"latest SDK migration guide" -> force fresh crawl

This Python code structures a low-latency query planner. It takes the raw user query as input and returns a structured execution plan, including classified intent and decomposed sub-queries. In production, the decision can come from a small classifier, rules plus embeddings, or a compact model rather than a frontier synthesis model.

routing-targets.py

import json
from dataclasses import asdict, dataclass, field

@dataclass(frozen=True)
class QueryPlan:
    intent: str
    route: str
    sub_queries: list[str] = field(default_factory=list)
    filters: dict[str, str | bool] = field(default_factory=dict)
    needs_fresh_data: bool = False
    complexity: str = "simple"

def understand_query(user_query: str) -> QueryPlan:
    """Plan with fast routing logic; swap rules for a small classifier in production."""
    text = user_query.lower()

    if any(term in text for term in ("build-", "quota", "runner status")):
        return QueryPlan(
            intent="factual",
            route="structured_api",
            filters={"requires_live_data": True},
            needs_fresh_data=True,
        )

    if "parser-v2" in text and "service account" in text:
        return QueryPlan(
            intent="comparison",
            route="hybrid_search",
            sub_queries=[
                "parser-v2 deprecation policy",
                "service-account SDK migration exceptions",
            ],
            filters={"doc_type": "policy", "tenant_scoped": True},
            complexity="moderate",
        )

    return QueryPlan(
        intent="exploratory",
        route="hybrid_search",
        sub_queries=[user_query],
        complexity="moderate",
    )

plan = understand_query("Which deprecation policy applies to parser-v2 SDK migration for service accounts?")
print(json.dumps(asdict(plan), indent=2))

Output

{
  "intent": "comparison",
  "route": "hybrid_search",
  "sub_queries": [
    "parser-v2 deprecation policy",
    "service-account SDK migration exceptions"
  ],
  "filters": {
    "doc_type": "policy",
    "tenant_scoped": true
  },
  "needs_fresh_data": false,
  "complexity": "moderate"
}

2. Multi-stage retrieval

We use a hybrid retrieval strategy to broaden candidate recall before refining for precision. A single retrieval method is rarely sufficient for complex queries. Relying only on keyword search might miss documents that use synonyms, while relying entirely on dense vector embeddings (encoding data as points in a high-dimensional space so related concepts are near each other) can overlook exact matches for specific names, product IDs, or acronyms.

By designing a multi-stage pipeline, we capture the best of both approaches. The first stage runs sparse retrieval and dense retrieval in parallel, then fuses the ranked lists with Reciprocal Rank Fusion (RRF)^{[3]Reference 3Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.https://dl.acm.org/doi/10.1145/1571941.1572114}. The second stage applies a more expensive reranker (often a cross-encoder or late-interaction model^{[4]Reference 4ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.https://arxiv.org/abs/2004.12832}) to a much smaller candidate set, passing only the most relevant chunks or parent passages to the LLM.

This visual outlines the flow of this multi-stage retrieval process.

Hybrid retrieval funnel: BM25, dense, and fresh or API routes fan out for recall, access control runs before fusion, a reranker cuts to top 50, and packing keeps 8 cited chunks. — Retrieve wide, filter access before fusion, rerank a small set, then pack only cited evidence.

Why hybrid retrieval?

BM25 (Best Match 25, a reliable keyword search algorithm) excels at exact keyword matching; dense retrieval captures semantic similarity. Dense retrieval typically uses bi-encoders (which encode queries and documents separately for fast lookup), though more advanced systems may use architectures like ColBERT (Contextualized Late Interaction over BERT) for late interaction between query and document tokens.^{[4]Reference 4ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.https://arxiv.org/abs/2004.12832} On a mixed workload, combining retrievers can surface candidates that either one would miss alone.

In practice, hybrid search isn't just "run two retrievers." You also need rank fusion, chunking, and evidence assembly. A common pattern is to embed short child chunks for recall, then retrieve the larger parent passage for synthesis so the model sees enough surrounding context to answer cleanly.

Why fuse on rank instead of raw score? BM25 and dense-similarity scores have different scales, so adding them without calibration is brittle. Reciprocal Rank Fusion scores documents from rank positions: score(d) = sum over lists of 1 / (k + rank(d)). Cormack et al. evaluated RRF with k = 60; treat that value as a starting point to validate, not a universal optimum.^{[3]Reference 3Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.https://dl.acm.org/doi/10.1145/1571941.1572114} Apply authorization filtering before fusion so a high-ranked forbidden document never becomes generation context.

authorized-rrf-fusion.py

from collections import defaultdict

tenant = "workspace-a"
acl = {"deprecation": {"workspace-a"}, "rotation": {"workspace-a"}, "private-roadmap": {"workspace-b"}}
sparse_ranked = ["deprecation", "private-roadmap", "rotation"]
dense_ranked = ["private-roadmap", "rotation", "deprecation"]

def allowed_ranked(document_ids: list[str]) -> list[str]:
    return [document_id for document_id in document_ids if tenant in acl[document_id]]

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
    scores: defaultdict[str, float] = defaultdict(float)
    for ranking in rankings:
        for rank, document_id in enumerate(ranking, start=1):
            scores[document_id] += 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

fused = reciprocal_rank_fusion([allowed_ranked(sparse_ranked), allowed_ranked(dense_ranked)])
assert "private-roadmap" not in fused
print("authorized_fused_docs:", fused)

Output

authorized_fused_docs: ['deprecation', 'rotation']

Freshness and index tiers

Real search products rarely use one monolithic index. A useful pattern is a hot-warm split:

Tier	Typical content	Update pattern	Query behavior
Hot	News, changelogs, live API snapshots	Seconds to minutes	Queried when freshness detector fires
Warm	Product docs, internal wikis, help center content	Minutes to hours	Default hybrid search tier
Cold	Stable archive and long-tail corpus	Batch compaction	Backfill recall when warm tier misses

Freshness should influence ranking, not replace relevance. A common pattern is to combine retrieval score with time decay, then let the reranker decide whether the newer document is better. If you rank purely by recency, a low-quality post from five minutes ago can outrank the canonical source.

Latency budget

TTFT is a critical constraint of this system. The values below are an illustrative budget; validate them against product objectives and measured stage timings.

Stage	Example latency	Method	Parallelism
Query planning	10-50ms	Small router / rewriter	Can overlap with initial retrieval
Retrieval and fetch	150-500ms	Sparse, dense, API, or web calls	Fan-out within allowed routes
Reranking	50-150ms	Cross-encoder GPU batch	Batch allowed candidates together
Evidence packing	10-30ms	Dedupe, trim, retain source IDs	Sequential after ranking
LLM prefill to first token	300-800ms	Streaming generation	Sequential after context exists
Total to first token	~0.8-1.5 seconds	Example product objective	Depends on overlap and tail latency

Time-to-first-token timeline where planning overlaps initial retrieval, then reranking, evidence packing, and model prefill run in sequence until the first token at 800 milliseconds. — Budget TTFT on one timeline. Overlap what you can, then measure the remaining serial path by route.

Note: If you support speculative retrieval, cheap lexical or dense search can begin before rewrite finishes. If you don't, planning and retrieval stack sequentially. Make that decision explicit when you budget TTFT.

ttft-critical-path.py

planner_ms = 40
initial_retrieval_ms = 260
rerank_ms = 90
pack_ms = 20
prefill_to_first_token_ms = 430

sequential_ttft_ms = planner_ms + initial_retrieval_ms + rerank_ms + pack_ms + prefill_to_first_token_ms
overlapped_ttft_ms = max(planner_ms, initial_retrieval_ms) + rerank_ms + pack_ms + prefill_to_first_token_ms

assert sequential_ttft_ms == 840
assert overlapped_ttft_ms == 800
print("sequential_ttft_ms:", sequential_ttft_ms)
print("overlapped_ttft_ms:", overlapped_ttft_ms)

Output

sequential_ttft_ms: 840
overlapped_ttft_ms: 800

3. LLM synthesis with grounded citations

The synthesis prompt can require citations and abstention, but prompting can't prove grounding. The product must keep source IDs attached to evidence, check generated claims, and decide what may be shown while support is pending or missing.

Grounding isn't just a prompt problem. Evidence packing matters too. Put the strongest chunks first, leave room for the model to answer, and avoid stuffing dozens of mediocre passages into the middle of a huge prompt. Long-context models still show "lost in the middle" behavior, where evidence in the center of the context window gets used less reliably.^{[5]Reference 5Lost in the Middle: How Language Models Use Long Contextshttps://arxiv.org/abs/2307.03172} That's one more reason reranking and tight top-K beat dumping everything into a large context window.

The citations a user sees can be a subset of retrieved evidence, but each visible citation must bind to a specific claim and its source span. Don't treat a source list as proof that every generated sentence is supported.

Before generation, pack allowed passages into a bounded context while preserving source IDs:

evidence-packing-boundary.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Passage:
    source_id: int
    tenant: str
    text: str
    tokens: int
    rerank_score: float

def pack_allowed(passages: list[Passage], tenant: str, budget: int) -> list[Passage]:
    selected: list[Passage] = []
    used = 0
    for passage in sorted(passages, key=lambda item: item.rerank_score, reverse=True):
        if passage.tenant != tenant or used + passage.tokens > budget:
            continue
        selected.append(passage)
        used += passage.tokens
    return selected

candidates = [
    Passage(1, "workspace-a", "Parser-v2 migration starts within 30 days.", 7, 0.95),
    Passage(2, "workspace-b", "Private roadmap milestone.", 4, 0.99),
    Passage(3, "workspace-a", "Service-account exceptions require security approval.", 5, 0.82),
]
packed = pack_allowed(candidates, tenant="workspace-a", budget=12)
assert [passage.source_id for passage in packed] == [1, 3]
print("packed_source_ids:", [passage.source_id for passage in packed])

Output

packed_source_ids: [1, 3]

Keeping temperature low reduces answer variance, but it doesn't create factuality. The next deterministic example represents a support gate around generated claims: a real synthesizer can draft text, while publication depends on evidence checks.

grounded-answer-contract.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Source:
    source_id: int
    title: str
    content: str

def contains_all(source: Source | None, required_terms: tuple[str, ...]) -> bool:
    if source is None:
        return False
    text = source.content.lower()
    return all(term in text for term in required_terms)

def grounded_answer(sources: list[Source]) -> str:
    by_id = {source.source_id: source for source in sources}
    if not contains_all(by_id.get(1), ("parser-v2", "within 30 days")):
        return "Insufficient cited evidence to answer."
    if not contains_all(by_id.get(2), ("security approval", "service-account")):
        return "Insufficient cited evidence to answer."
    return (
        "Start the parser-v2 SDK migration within 30 days [1]. "
        "Service-account OAuth rotation exceptions require security approval [2]."
    )

sources = [
    Source(
        source_id=1,
        title="Parser v2 Deprecation Policy",
        content="Parser-v2 SDK migrations must start within 30 days.",
    ),
    Source(
        source_id=2,
        title="Service Account Rotation Exceptions",
        content="Service-account OAuth rotation exceptions require security approval.",
    ),
]

answer = grounded_answer(sources)
assert grounded_answer(sources[:1]) == "Insufficient cited evidence to answer."
assert grounded_answer(sources[:1] + [Source(2, "Service Account Rotation Exceptions", "No approval rule.")]) == "Insufficient cited evidence to answer."
print(answer)

Output

Start the parser-v2 SDK migration within 30 days [1]. Service-account OAuth rotation exceptions require security approval [2].

4. Streaming architecture

For one-way progressive delivery, Server-Sent Events (SSE) work through the browser EventSource API over HTTP and are straightforward to render^{[6]Reference 6Using server-sent eventshttps://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events}. Streaming creates a product-policy choice: either hold factual claims until verification completes, or label streamed text as provisional and update or retract unsupported claims.

Streaming answer flow where citations and provisional text render first, a parallel claim-check branch verifies support, and the UI later attaches supported or blocked status. — Stream text early only when it's marked provisional. Support checks run in parallel and attach final status after verification.

Streaming endpoint

This event contract fits a low-risk answer path that permits provisional rendering. It emits the plan and sources, declares pending support, streams text, then attaches the result of claim checks. A higher-risk workflow can omit text frames until support becomes supported.

streaming-endpoint.py

import json

def sse(event_type: str, data: object) -> str:
    payload = {"type": event_type, "data": data}
    return f"data: {json.dumps(payload, separators=(',', ':'))}\n\n"

def event_stream() -> list[str]:
    return [
        sse("plan", {"intent": "comparison", "route": "hybrid_search"}),
        sse("sources", [{"title": "Parser v2 Deprecation Policy", "source_id": 1}]),
        sse("support_state", {"state": "pending", "display": "provisional"}),
        sse("text", "For parser-v2 SDK migration"),
        sse("text", ", start within 30 days [1]."),
        sse(
            "support_state",
            {"state": "supported", "claim_ids": ["migration-window"]},
        ),
        "data: [DONE]\n\n",
    ]

for frame in event_stream():
    print(frame, end="")

Output

data: {"type":"plan","data":{"intent":"comparison","route":"hybrid_search"}}

data: {"type":"sources","data":[{"title":"Parser v2 Deprecation Policy","source_id":1}]}

data: {"type":"support_state","data":{"state":"pending","display":"provisional"}}

data: {"type":"text","data":"For parser-v2 SDK migration"}

data: {"type":"text","data":", start within 30 days [1]."}

data: {"type":"support_state","data":{"state":"supported","claim_ids":["migration-window"]}}

data: [DONE]

In production, add periodic heartbeat events and configure upstream proxies not to buffer the stream. Also log state transitions so provisional output that becomes blocked can be audited.

Why answers still hallucinate and how to catch them

One of the most critical challenges in search is hallucination, when the LLM generates plausible but incorrect facts. Even with RAG, models can misinterpret complex details. Strong systems usually stack multiple checks rather than betting on one verifier:

Method	What it checks	Relative cost	Main weakness
Quote / span matching	Does the cited span appear in the source?	Very low	Misses paraphrases and logical contradictions
Natural Language Inference (NLI)	Does the cited passage entail the claim?	Medium	Can struggle on ambiguous or very long passages
LLM-as-judge / fallback pass	Should unsupported claims be flagged or regenerated?	High	Requires measured latency, cost, and calibration

NLI-based verification estimates whether a cited passage entails a claim rather than merely sharing words. TRUE^{[7]Reference 7TRUE: Re-evaluating Factual Consistency Evaluation.https://arxiv.org/abs/2204.04991} found that large-scale NLI methods and question-generation-plus-answering methods were strong automatic factual-consistency checks on its benchmarks, so NLI is a useful measured layer, not proof of correctness.

This function is deliberately only a lexical precheck. It can cheaply reject a missing phrase; it can't label entailment. Claims that pass it still need a calibrated verifier or a policy that keeps them provisional.

lexical-citation-precheck.py

import json
import re

def lexical_precheck(claim: str, source_text: str) -> dict[str, str | bool]:
    claim_terms = set(re.findall(r"[a-z0-9]+", claim.lower()))
    source_terms = set(re.findall(r"[a-z0-9]+", source_text.lower()))
    term_match = claim_terms.issubset(source_terms)

    return {
        "claim": claim,
        "term_match": term_match,
        "next_step": "entailment_check" if term_match else "block_claim",
    }

result = lexical_precheck(
    claim="Parser-v2 migrations must start within 30 days",
    source_text="Parser-v2 SDK migrations must start within 30 days.",
)

print(json.dumps(result, indent=2))

Output

{
  "claim": "Parser-v2 migrations must start within 30 days",
  "term_match": true,
  "next_step": "entailment_check"
}

Skills to defend in review

Skill	What good looks like
Offline vs online split	You separate ingestion, cleanup, chunking, and index refresh from the latency-sensitive query path.
Multi-stage query path	You can explain planning, hybrid retrieval, reranking, evidence packing, synthesis, and verification as distinct stages.
Recall improvements	You know when to use query rewriting, decomposition, routing, and HyDE-style hypothetical documents.
Citation verification	You check claims against cited sources with span checks, NLI, and selective fallback regeneration.
Streaming UX	You design progressive rendering with SSE events for plan, sources, text, and verification state.
Latency budget	You allocate TTFT across planning, retrieval, reranking, prefill, and generation instead of treating "LLM latency" as one number.
Serving levers	You know where PagedAttention and speculative decoding help, and why they don't replace retrieval discipline.
Evaluation loop	You combine RAGAS-style reference-free metrics, labeled retrieval metrics, citation metrics, and online feedback.

Production questions to answer

How do you handle queries that need real-time data? Real-time data needs its own freshness pipeline. Use a hot index for newly crawled content plus direct routes to live APIs or structured stores for exact numeric fields. When the planner marks a query as freshness-sensitive, search the hot tier first, apply recency-aware scoring, then merge with the warmer corpus so the LLM sees both new evidence and canonical background context.

How would you detect and handle hallucinated citations? Decompose the generated answer into atomic claims, attach each claim to its cited source, and verify entailment with NLI or a calibrated lighter check. If support is weak or contradictory, flag the statement, remove it, or regenerate with stricter grounding and a smaller evidence set.

What's your latency budget for each pipeline stage? For a consumer-facing experience, a reasonable example SLO is roughly 0.8-1.5 seconds to first token. Query planning often fits in 10-50ms, retrieval fan-out in 150-500ms, reranking in 50-150ms, evidence packing in 10-30ms, and synthesis in roughly 300-800ms to first token. Split synthesis into prefill and decode because prompt length drives prefill cost while answer length drives decode cost.

How do you evaluate answer quality at scale? Start with reference-free metrics such as faithfulness, answer relevance, and context relevance. Add labeled retrieval metrics such as Recall@k or nDCG on benchmark sets, plus citation precision and unsupported-claim rate. In production, track citation click-through, reformulation rate, dwell time, abandonment, and explicit user feedback.

Common design failures

Designing everything around a single LLM call instead of specialized routers, retrievers, rerankers, and verifiers.
Treating structured or freshness-critical queries like generic vector search.
Ignoring latency requirements for interactive search experiences.
Skipping rigorous citation verification and trusting prompt instructions alone.
Treating long context windows as a replacement for reranking and evidence packing.
Overlooking inference cost per query when scaling to millions of users.

Common mistakes when building generative search

Symptom	Cause	Fix
TTFT is consistently >2 seconds	Running a frontier LLM on every query, even simple lookups	Add a lightweight router that sends factual queries to a small model or cache
Answers cite the wrong source	Evidence packing stuffs too many mediocre passages into the prompt	Reduce top-K after reranking and place the strongest chunks at the start and end of context
Hallucinations persist despite RAG	Missing claim-level verification; relying only on prompt instructions	Stack span matching, NLI, and selective LLM-as-judge checks
Cost per query spikes unpredictably	No freshness tiering; every query hits real-time web retrieval	Route only freshness-marked queries to the hot tier and cache stable answers
Retrieval misses exact API versions or build IDs	Using only dense semantic search without keyword fallback	Hybrid sparse + dense retrieval with BM25 for exact matches

What running this costs

Building a search engine at scale requires careful cost management. Exact dollar figures drift with vendor pricing, model choice, and answer length, so reason from units of work rather than one fixed price table.

Cost routing ladder where requests escalate from cache to small model to mid-tier synthesis to frontier reasoning only when freshness, evidence burden, or ambiguity require it. — Promote queries only when requirements force it. Freshness, retrieval fan-out, prompt length, and verification policy change route cost.

Cost center	Billable unit	What makes it expensive
Web retrieval	API calls and fetched documents	More sub-queries, more URLs fetched
Reranking	Query-document pairs	Larger candidate set or heavier reranker
Synthesis	Prompt + completion tokens	Long context windows and long answers
Verification	Claims checked against passages	Dense citation requirements, many claims

Before writing the formula, use scenario inputs rather than presenting provider pricing as fixed:

Web retrieval: ~$0.002
Reranking 200 query-document pairs: ~$0.001
1,500 prompt tokens + 400 output tokens at $0.003 per 1K tokens: ~$0.006
Verifying 5 claims: ~$0.001

For these inputs, total cost is roughly $0.01 per query. At 1,000 QPS, that scenario approaches $600 per minute, which is why routing and caching need their own budgets.

query-cost-envelope.py

retrieval_usd = 0.002
rerank_usd = 0.001
prompt_tokens = 1500
output_tokens = 400
token_usd_per_thousand = 0.003
verification_usd = 0.001
queries_per_second = 1000

query_usd = (
    retrieval_usd
    + rerank_usd
    + (prompt_tokens + output_tokens) / 1000 * token_usd_per_thousand
    + verification_usd
)
per_minute_usd = query_usd * queries_per_second * 60

assert round(query_usd, 4) == 0.0097
print("scenario_query_usd:", round(query_usd, 4))
print("scenario_per_minute_usd:", round(per_minute_usd, 2))

Output

scenario_query_usd: 0.0097
scenario_per_minute_usd: 582.0

\begin{aligned} \text{cost/query} \approx\;& C_{\text{search}} + N_{\text{pairs}} \cdot C_{\text{rerank}} \\ &+ (T_{\text{prompt}} + T_{\text{output}}) \cdot C_{\text{token}} \\ &+ N_{\text{claims}} \cdot C_{\text{verify}} \end{aligned}

Which stage dominates variable cost depends on provider rates, fan-out, prompt length, and verification policy. Measure the breakdown, then route cacheable or structured queries away from unnecessary synthesis and trim top-K where retrieval evaluation allows it.

Serving-side latency levers

TTFT can be dominated by retrieval, prompt assembly, or prefill; long answers can shift pressure toward decode. Profile the critical path before choosing an optimization.

PagedAttention improves KV-cache memory management and can increase concurrency when KV allocation is the binding constraint.^{[8]Reference 8Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180} If decode is the bottleneck, speculative decoding can reduce latency when a draft model is accepted often enough; benchmark it for the answer shapes and target model in this service.^{[9]Reference 9Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}

These serving optimizations matter only after retrieval depth and prompt length are under control. If you keep sending 40 mediocre passages to the model, no amount of serving cleverness will rescue TTFT.

GPU capacity planning

For a scenario with 1K QPS sustained and 1.5 seconds average latency per request:

\text{Concurrent requests} = \text{QPS} \times \text{avg latency} = 1000 \times 1.5 = 1500

This is Little's Law (average items in system = arrival rate × average time in system) in practice: at 1000 requests/second with 1.5-second average service time, around 1500 requests are in-flight simultaneously.

capacity-envelope.py

from math import ceil

queries_per_second = 1000
average_latency_seconds = 1.5
average_output_tokens = 120
effective_decode_tokens_per_second_per_gpu = 6000

in_flight_requests = queries_per_second * average_latency_seconds
decode_gpus = ceil(
    queries_per_second
    * average_output_tokens
    / effective_decode_tokens_per_second_per_gpu
)

assert in_flight_requests == 1500
print("in_flight_requests:", int(in_flight_requests))
print("decode_gpus_before_headroom:", decode_gpus)

Output

in_flight_requests: 1500
decode_gpus_before_headroom: 20

The next step is to translate concurrency into throughput and memory requirements for your actual serving stack. Don't assume a universal "requests per GPU" number. Measure your own model, context length, batching strategy, and quantization settings. PagedAttention improves packing efficiency and reduces KV-cache fragmentation under concurrent load^{[8]Reference 8Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180}, but capacity still depends on empirical tokens-per-second benchmarks.

\text{GPUs needed} \approx \left\lceil \frac{\text{QPS} \times \text{avg output tokens}}{\text{effective decode tok/s/GPU}} \right\rceil

For long prompts, also budget prefill separately:

\text{Prefill capacity} \approx \frac{\text{QPS} \times \text{avg prompt tokens}}{\text{effective prefill tok/s/GPU}}

In practice, size the system against whichever pool saturates first: retriever GPUs, reranker GPUs, prefill workers, or decode workers. The answer model may be an expensive tier, but measured retrieval or reranking can still be the first bottleneck.

How to know it's working

RAGAS^{[10]Reference 10RAGAS: Automated Evaluation of Retrieval Augmented Generation.https://arxiv.org/abs/2309.15217} introduced reference-free evaluation around faithfulness, answer relevance, and context relevance. It's one offline layer when gold answers are scarce, not a replacement for labeled retrieval metrics, citation checks, or online product metrics: a response can be consistent with mediocre evidence and still fail the user.

Layer	Metric	What it catches	Typical requirement
Reference-free offline	Faithfulness	Unsupported claims	LLM/NLI judge over answer vs context
Reference-free offline	Answer relevance	Missed user intent	Generated-question similarity or judge
Reference-free offline	Context relevance	Noisy, redundant retrieved context	Judge over question vs retrieved passages
Labeled offline	Recall@k / nDCG / MRR	Retriever missed or buried the right source	Gold passages or reliable click labels
Citation quality	Citation precision / unsupported-claim rate	Wrong citations or uncited claims	Claim extraction + source verification
Online	Reformulation rate / citation CTR / abandonment	Real user dissatisfaction	Production logging and experiments

This example wires a scorecard together with simple token-overlap proxies so it stays runnable. Its citation-resolution metric only checks whether a visible citation ID maps to an available passage; it doesn't prove claim support. These proxies aren't RAGAS or calibrated faithfulness judges. Replace them in an evaluation service while keeping labeled retrieval metrics separate.

how-to-know-its-working.py

import json
import re
from dataclasses import dataclass
from collections.abc import Mapping
from math import log2

@dataclass(frozen=True)
class Passage:
    id: str
    text: str

def tokens(text: str) -> set[str]:
    return set(text.lower().replace(",", "").replace(".", "").split())

def overlap(a: str, b: str) -> float:
    left = tokens(a)
    right = tokens(b)
    return len(left & right) / max(1, len(left))

def citation_resolution_precision(answer: str, cited_passages: list[Passage]) -> float:
    available_ids = {passage.id for passage in cited_passages}
    cited_ids = re.findall(r"\[([^\[\]]+)\]", answer)
    return sum(cited_id in available_ids for cited_id in cited_ids) / max(1, len(cited_ids))

def recall_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int) -> float:
    return len(set(retrieved_ids[:k]) & gold_ids) / max(1, len(gold_ids))

def ndcg_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int) -> float:
    gains = [1.0 if item in gold_ids else 0.0 for item in retrieved_ids[:k]]
    dcg = sum(gain / log2(rank + 2) for rank, gain in enumerate(gains))
    ideal = sum(1.0 / log2(rank + 2) for rank in range(min(len(gold_ids), k)))
    return dcg / ideal if ideal else 0.0

def score_search_response(
    query: str,
    answer: str,
    retrieved_passages: list[Passage],
    cited_passages: list[Passage],
    gold_passage_ids: set[str] | None = None,
) -> Mapping[str, float]:
    context = " ".join(passage.text for passage in retrieved_passages)
    scorecard = {
        "answer_context_token_overlap": overlap(answer, context),
        "query_answer_token_overlap": overlap(query, answer),
        "query_context_token_overlap": overlap(query, context),
        "citation_resolution_precision": citation_resolution_precision(answer, cited_passages),
    }

    if gold_passage_ids is not None:
        retrieved_ids = [p.id for p in retrieved_passages]
        scorecard["recall_at_20"] = recall_at_k(retrieved_ids, gold_passage_ids, k=20)
        scorecard["ndcg_at_10"] = ndcg_at_k(retrieved_ids, gold_passage_ids, k=10)

    return {metric: round(value, 2) for metric, value in scorecard.items()}

retrieved = [
    Passage("1", "Parser-v2 SDK migrations must start within 30 days."),
    Passage("2", "Service-account OAuth rotation exceptions require security approval."),
    Passage("3", "Legacy parser-v1 migrations follow a separate workflow."),
]
answer = "Parser-v2 SDK migrations must start within 30 days [1]."

print(json.dumps(
    score_search_response(
        query="Which deprecation policy applies to parser-v2 SDK migration for service accounts?",
        answer=answer,
        retrieved_passages=retrieved,
        cited_passages=retrieved[:1],
        gold_passage_ids={"1", "2"},
    ),
    indent=2,
))

Output

{
  "answer_context_token_overlap": 0.89,
  "query_answer_token_overlap": 0.18,
  "query_context_token_overlap": 0.18,
  "citation_resolution_precision": 1.0,
  "recall_at_20": 1.0,
  "ndcg_at_10": 1.0
}

Scaling without blowing the budget

As traffic grows, evaluate each latency or cost lever against quality and authorization tests:

Strategy	Implementation Details	Primary Benefit
Semantic response cache	Key answers by authorization scope and evidence version; reuse only after similarity and freshness thresholds pass.	Can avoid synthesis work for eligible repeat queries.
Model routing	Route exact live facts to structured backends and measured simple paths to smaller models.	Can lower cost without weakening answer routes that require synthesis.
Speculative retrieval	Start an allowed cheap retriever while rewrite or decomposition runs, then merge or cancel.	Can shorten measured critical path when early retrieval is useful.
Asynchronous verification	Mark streamed claims pending, or hold high-risk facts until checks return.	Trades display latency against unsupported-claim exposure explicitly.
Geographic distribution	Place permitted retrieval and caches near users when data residency policy allows it.	Can reduce network latency for eligible data paths.
Feedback integration	Log interactions with privacy controls and use labeled review before training rerankers.	Supplies evidence for retriever or reranker changes.

Mastery check

Key concepts

Offline ingest vs online query path
Query planning, routing, and freshness detection
Hybrid retrieval with BM25, dense retrieval, and rank fusion
Reranking, evidence packing, and source-ID preservation
Grounded synthesis with citations
Claim-level verification with span checks and NLI
Streaming answer delivery with SSE
Latency budgeting, routing, and cost control
Offline and online evaluation for search quality

What strong answers show

Foundational: Explains why generative search needs both offline indexing and an online retrieval plus synthesis path
Intermediate: Chooses when to route a query to structured APIs, hybrid retrieval, or fresh-web lookup
Intermediate: Explains why hybrid sparse plus dense retrieval beats either retriever alone for developer-doc search
Intermediate: Designs a TTFT budget that includes planning, retrieval, reranking, packing, and synthesis rather than treating latency as one LLM number
Advanced: Designs claim-level verification that distinguishes quote matching, NLI, and selective fallback regeneration
Advanced: Defends routing and caching rules that keep cost and latency under control at high QPS
Advanced: Separates retrieval metrics, citation metrics, and online product metrics in one search evaluation loop

Follow-up questions

Common pitfalls

"One LLM call can replace search architecture"

Symptom: Answers sound polished but use stale, missing, or unauthorized evidence.
Cause: Retrieval, routing, and verification were collapsed into one generation step.
Fix: Split the system into planner, retriever, reranker, evidence packer, synthesizer, and verifier. Keep trust boundaries outside the model.

"Dense retrieval is enough"

Symptom: Search misses exact API versions, build IDs, or policy titles even though semantically related passages appear.
Cause: Dense retrieval is good at paraphrases, not exact lexical anchors.
Fix: Use hybrid sparse plus dense retrieval, then fuse ranks before reranking.

"Long context makes reranking optional"

Symptom: The right document was retrieved, but the final answer still ignores or misstates it.
Cause: Dumping many mediocre passages into context creates prompt bloat and lost-in-the-middle effects.
Fix: Rerank aggressively, keep source IDs attached, and pack only the strongest evidence the answer model needs.

"Citations alone prove factuality"

Symptom: Answers show footnotes but still contain unsupported or misleading claims.
Cause: The system attached sources without checking whether each claim is entailed by the cited text.
Fix: Add claim extraction plus span checks, NLI, and selective fallback regeneration for weak support.

"Cost tuning starts at the model endpoint"

Symptom: Query cost spikes even after switching to a cheaper model.
Cause: Retrieval fan-out, rerank pairs, long prompts, and verification work were left uncontrolled.
Fix: Route simple queries earlier, trim top-K harder, cache safely, and reserve frontier synthesis for genuinely multi-hop requests.

Practice drill

Create a search-answer trace for one developer-doc query that mixes live facts and cited policy text:

Split the query into structured API calls, sparse retrieval, dense retrieval, reranking, evidence packing, synthesis, and claim verification.
Log which documents were removed by ACL filtering before fusion or reranking.
Add a claim-support table with atomic claim, cited span, entailment result, and final support state.
Mark latency budget by stage and choose one optimization that doesn't weaken authorization or citation support.

The trace should prove the answer was authorized, fresh, cited, and verifiable, not fluent alone.

Next Step

Continue to Vision-Language Models & CLIP

Search turned retrieved text evidence into cited answers. Now you'll make images searchable too: learn how CLIP aligns text and images, then see how modern vision-language models bring visual evidence into multimodal systems.

PreviousMulti-Tenant LLM Platform

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

Precise Zero-Shot Dense Retrieval without Relevance Labels.

Gao, L., Ma, X., Lin, J., & Callan, J. · 2022 · arXiv preprint

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

Using server-sent events

MDN Web Docs · 2026

TRUE: Re-evaluating Factual Consistency Evaluation.

Honovich, O., et al. · 2022 · NAACL 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

LLM-Powered Search Engine

What makes LLM-powered search different from a chatbot over documents?

From keywords to reasoning

Traditional vs LLM search

What does the LLM add after traditional retrieval?

What the system must deliver

Why must "workspace isolation" be a first-class requirement in this design?

How the pipeline works end to end

Tracing a concrete query

Why separate offline ingest from the online query path?

Inside the pipeline

1. Query understanding

Routing targets

When should a query planner route to a structured API instead of generic RAG?

2. Multi-stage retrieval

Why hybrid retrieval?

Freshness and index tiers

Why does hybrid retrieval beat dense-only retrieval for developer-doc search?

Latency budget

What is the main latency tradeoff in query planning?

3. LLM synthesis with grounded citations

Why is low temperature not enough to make generated search factual?

4. Streaming architecture

Streaming endpoint

Why use SSE for search-style answer streaming?

Why answers still hallucinate and how to catch them

Why can lexical matching not replace NLI or another calibrated verifier?

Skills to defend in review

Production questions to answer

Common design failures

Common mistakes when building generative search

What running this costs

In the worked 1,000 QPS cost scenario, why does routing matter?

Serving-side latency levers

GPU capacity planning

Why can reranking or retrieval be the bottleneck even when the LLM is expensive?

How to know it's working

Why keep reference-free RAGAS-style metrics separate from labeled retrieval metrics?

Scaling without blowing the budget

Why is semantic response caching risky without freshness and permission checks?

Mastery check

Key concepts

What strong answers show

Follow-up questions

A user asks, "What is the SLA for service-account token rotation exceptions?" How should the pipeline handle it?

Why is a single frontier-model call not a complete search architecture?

Your answer is fluent and cited, but users still abandon the result. What metric layer should you inspect next?

Why can long-context models still fail even when you retrieved the right passage?

Common pitfalls

"One LLM call can replace search architecture"

"Dense retrieval is enough"

"Long context makes reranking optional"

"Citations alone prove factuality"

"Cost tuning starts at the model endpoint"

Practice drill

Mastery Check

Discussion