LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnCore LLM FoundationsChunking Strategies
🔍MediumRAG & Retrieval

Chunking Strategies

Turn clean documents into retrieval units that preserve answers, citations, and measurable search quality.

12 min read
Learning path
Step 50 of 155 in the full curriculum
File Ingestion for AILLM Benchmarks & Limitations

The previous lesson produced a clean record from the ShopFlow returns policy:

returns-policy-v3.pdf, page 7: Damaged electronics: report within 48 hours with photos.

That sentence is trustworthy evidence only while its pieces stay together. Split it into one chunk containing "Damaged electronics" and another containing "48 hours with photos," and a search result can retrieve the condition without the deadline or the deadline without its condition.

Chunking turns normalized records into searchable evidence units. In a retrieval-augmented generation (RAG) system, a retriever selects passages from an external index and the generator answers from those passages.[1] Your chunk boundaries decide what a retrieved passage can prove.

A clean policy record split into answerable chunks with locator metadata, then indexed and checked against labeled customer questions. A clean policy record split into answerable chunks with locator metadata, then indexed and checked against labeled customer questions.
Ingestion made text faithful. Chunking now decides which complete evidence units can be searched and cited.

What a useful chunk must preserve

Suppose a customer asks, "My headphones arrived damaged. When must I report it, and what do I send?"

Candidate retrieval unitSearchable?Answerable?Problem
Damaged electronics: report withinyesnodeadline and proof are missing
48 hours with photos. Unopened electronics: return within 14 days.yesnocondition is missing and a competing rule is present
Damaged electronics: report within 48 hours with photos.yesyespreserves condition, deadline, and proof

The first engineering requirement isn't "make chunks small." It's "make each retrieved chunk a defensible piece of evidence."

check-answerable-chunks.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Chunk: 5 chunk_id: str 6 text: str 7 source_id: str 8 locator: str 9 10def is_answerable(chunk: Chunk) -> bool: 11 required = ["damaged electronics", "48 hours", "photos"] 12 lowered = chunk.text.lower() 13 return all(phrase in lowered for phrase in required) 14 15chunks = [ 16 Chunk("broken-condition", "Damaged electronics: report within", "returns-policy-v3.pdf", "page=7"), 17 Chunk("broken-window", "48 hours with photos. Unopened electronics return within 14 days.", "returns-policy-v3.pdf", "page=7"), 18 Chunk("complete", "Damaged electronics: report within 48 hours with photos.", "returns-policy-v3.pdf", "page=7"), 19] 20 21for chunk in chunks: 22 print(f"{chunk.chunk_id}: answerable={is_answerable(chunk)}")
Output
1broken-condition: answerable=False 2broken-window: answerable=False 3complete: answerable=True

Fixed windows show the boundary failure

The simplest splitter takes windows of tokens. If a window has size NNN and overlap OOO, the next window starts after N−ON - ON−O tokens. Overlap repeats boundary text, which can help continuity, but it can't guarantee that a complete policy rule survives.

The lab uses whitespace-separated words as visible token stand-ins. A production pipeline should measure with the tokenizer used by its embedding model.

split-fixed-windows.py
1def fixed_windows(text: str, size: int, overlap: int) -> list[str]: 2 if size <= 0 or overlap < 0 or overlap >= size: 3 raise ValueError("require size > overlap >= 0") 4 words = text.split() 5 step = size - overlap 6 return [ 7 " ".join(words[start : start + size]) 8 for start in range(0, len(words), step) 9 if words[start : start + size] 10 ] 11 12policy = ( 13 "Damaged electronics report within 48 hours with photos. " 14 "Unopened electronics return within 14 days." 15) 16 17for index, chunk in enumerate(fixed_windows(policy, size=6, overlap=2)): 18 print(f"{index}: {chunk}")
Output
10: Damaged electronics report within 48 hours 21: 48 hours with photos. Unopened electronics 32: Unopened electronics return within 14 days. 43: 14 days.

Now measure what overlap bought you. More repeated words increase indexed text, but the rule becomes useful only when a window carries all three required pieces.

measure-overlap-cost-and-coverage.py
1def fixed_windows(text: str, size: int, overlap: int) -> list[str]: 2 words = text.split() 3 step = size - overlap 4 return [ 5 " ".join(words[start : start + size]) 6 for start in range(0, len(words), step) 7 if words[start : start + size] 8 ] 9 10def carries_damage_rule(text: str) -> bool: 11 lowered = text.lower() 12 return all(term in lowered for term in ["damaged electronics", "48 hours", "photos"]) 13 14policy = "Damaged electronics report within 48 hours with photos. Unopened electronics return within 14 days." 15for overlap in [0, 2, 4]: 16 chunks = fixed_windows(policy, size=7, overlap=overlap) 17 complete = sum(carries_damage_rule(chunk) for chunk in chunks) 18 indexed_words = sum(len(chunk.split()) for chunk in chunks) 19 print(f"overlap={overlap}: chunks={len(chunks)} indexed_words={indexed_words} complete={complete}")
Output
1overlap=0: chunks=2 indexed_words=14 complete=0 2overlap=2: chunks=3 indexed_words=18 complete=0 3overlap=4: chunks=5 indexed_words=28 complete=0

This is the right posture for overlap: a configuration to evaluate, not a ritual to apply to every document.

Prefer policy structure when you have it

File ingestion preserved headings and source locations. Use them. A heading boundary is more meaningful than an arbitrary word count when the source already says which rule belongs together.

LangChain's RecursiveCharacterTextSplitter is a common generic-text baseline. Its documentation describes trying separators in order, with the default sequence ["\n\n", "\n", " ", ""], so paragraphs are preferred before smaller cuts.[2] Start with explicit heading boundaries when the source provides them. Add the smaller-cut fallback only for sections that still exceed your size limit.

split-markdown-policy-sections.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class PolicyChunk: 5 heading: str 6 body: str 7 indexed_text: str 8 locator: str 9 10def headed_chunks(markdown: str, source_locator: str) -> list[PolicyChunk]: 11 chunks: list[PolicyChunk] = [] 12 heading = "Document" 13 body: list[str] = [] 14 15 def flush() -> None: 16 if body: 17 body_text = " ".join(body) 18 chunks.append( 19 PolicyChunk( 20 heading, 21 body_text, 22 f"{heading}\n{body_text}", 23 f"{source_locator}#{heading.lower().replace(' ', '-')}", 24 ) 25 ) 26 body.clear() 27 28 for line in markdown.strip().splitlines(): 29 if line.startswith("## "): 30 flush() 31 heading = line[3:] 32 elif line.strip(): 33 body.append(line.strip()) 34 flush() 35 return chunks 36 37policy = """## Damaged electronics 38Report within 48 hours with photos. 39## Unopened electronics 40Return within 14 days in original packaging.""" 41 42for chunk in headed_chunks(policy, "page=7"): 43 print(chunk.indexed_text.replace("\n", " | "), "|", chunk.locator)
Output
1Damaged electronics | Report within 48 hours with photos. | page=7#damaged-electronics 2Unopened electronics | Return within 14 days in original packaging. | page=7#unopened-electronics

Each chunk's indexed_text includes its heading, so the searchable text keeps the condition attached to the deadline. The locator separately preserves the path back to original evidence.

A policy table needs an equally deliberate rule. Splitting a row away from its column names turns exact information into ambiguous fragments.

keep-policy-table-with-header.py
1table = [ 2 "| Condition | Window | Evidence |", 3 "| --- | --- | --- |", 4 "| Damaged electronics | 48 hours | Photos |", 5 "| Unopened electronics | 14 days | Original packaging |", 6] 7 8def table_chunk(lines: list[str], heading: str) -> dict[str, str]: 9 return { 10 "heading": heading, 11 "text": "\n".join(lines), 12 "quality_check": "header_present" if lines[0].startswith("| Condition |") else "review", 13 } 14 15chunk = table_chunk(table, "Return windows") 16print(chunk["heading"], chunk["quality_check"], f"rows={len(table) - 2}") 17print("48 hours" in chunk["text"] and "Photos" in chunk["text"])
Output
1Return windows header_present rows=2 2True

For a long section, split inside that section while carrying its heading and original locator into every child chunk.

split-long-section-with-provenance.py
1def section_windows(text: str, heading: str, source: str, size: int) -> list[dict[str, str]]: 2 words = text.split() 3 return [ 4 { 5 "text": " ".join(words[start : start + size]), 6 "heading": heading, 7 "source": source, 8 "chunk_id": f"{source}#{heading.lower().replace(' ', '-')}-{start // size}", 9 } 10 for start in range(0, len(words), size) 11 ] 12 13children = section_windows( 14 "Report within 48 hours with photos. Include order number and a clear image of damage.", 15 heading="Damaged electronics", 16 source="returns-policy-v3.pdf:page=7", 17 size=7, 18) 19 20for child in children: 21 print(child["chunk_id"], "|", child["heading"], "|", child["text"])
Output
1returns-policy-v3.pdf:page=7#damaged-electronics-0 | Damaged electronics | Report within 48 hours with photos. Include 2returns-policy-v3.pdf:page=7#damaged-electronics-1 | Damaged electronics | order number and a clear image of 3returns-policy-v3.pdf:page=7#damaged-electronics-2 | Damaged electronics | damage.
Comparison of broken fixed windows, a complete section-aware policy chunk, and parent expansion for the damaged-electronics answer. Comparison of broken fixed windows, a complete section-aware policy chunk, and parent expansion for the damaged-electronics answer.
Only the complete retrieval unit preserves condition, deadline, and evidence requirement together.

Evaluate boundaries with labeled questions

A chunking strategy isn't good because it sounds sophisticated. It works when labeled questions retrieve complete supporting evidence.

Start with a tiny, transparent score before involving a vector database. For the query terms {damaged, electronics, photos, hours}, the complete damage-policy chunk matches all four. A fragment that contains only {damaged, electronics} may rank, but it can't answer the question.

ChunkMatching query termsContains answer?
complete damage rule4 / 4yes
damaged-condition fragment2 / 4no
unopened-item rule1 / 4no

The following lexical scorer is deliberately simple. It isolates the effect of boundaries; replace its score with real embeddings after the fixture and expected evidence are stable.

retrieve-complete-evidence.py
1import re 2 3def terms(text: str) -> set[str]: 4 return set(re.findall(r"[a-z0-9]+", text.lower())) 5 6def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]: 7 query_terms = terms(query) 8 return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"]))) 9 10query = "damaged electronics photos hours" 11chunks = [ 12 {"id": "damage-rule", "text": "Damaged electronics: report within 48 hours with photos."}, 13 {"id": "general-rule", "text": "Unopened electronics: return within 14 days."}, 14] 15 16hit = retrieve(query, chunks) 17print(hit["id"], hit["text"])
Output
1damage-rule Damaged electronics: report within 48 hours with photos.

Now compare a broken fixed-window configuration against a section-aware configuration using the same question and the same expected evidence phrase.

compare-chunking-configurations.py
1import re 2 3def terms(text: str) -> set[str]: 4 return set(re.findall(r"[a-z0-9]+", text.lower())) 5 6def top_chunk(query: str, chunks: list[str]) -> str: 7 query_terms = terms(query) 8 return max(chunks, key=lambda text: len(query_terms & terms(text))) 9 10query = "damaged electronics photos" 11expected = "Damaged electronics: report within 48 hours with photos." 12configs = { 13 "broken-fixed": [ 14 "Damaged electronics: report within", 15 "48 hours with photos. Unopened electronics: return within 14 days.", 16 ], 17 "section-aware": [ 18 "Damaged electronics: report within 48 hours with photos.", 19 "Unopened electronics: return within 14 days.", 20 ], 21} 22 23for name, chunks in configs.items(): 24 hit = top_chunk(query, chunks) 25 print(f"{name}: complete={expected in hit}")
Output
1broken-fixed: complete=False 2section-aware: complete=True

One question proves little. Ship a small labeled set containing policy exceptions, tables, and boundary failures, then measure each candidate splitter on exactly that set.

run-chunk-regression-suite.py
1import re 2 3def terms(text: str) -> set[str]: 4 return set(re.findall(r"[a-z0-9]+", text.lower())) 5 6def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]: 7 query_terms = terms(query) 8 return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"]))) 9 10chunks = [ 11 {"id": "damage", "text": "Damaged electronics: report within 48 hours with photos."}, 12 {"id": "unopened", "text": "Unopened electronics: return within 14 days."}, 13 {"id": "refund", "text": "Approved refunds return to the original payment method."}, 14] 15cases = [ 16 ("damaged electronics photos", "damage", "48 hours"), 17 ("unopened electronics return", "unopened", "14 days"), 18 ("refund payment method", "refund", "original payment"), 19] 20 21passed = 0 22for query, expected_id, evidence in cases: 23 hit = retrieve(query, chunks) 24 ok = hit["id"] == expected_id and evidence in hit["text"] 25 passed += int(ok) 26 print(query, "PASS" if ok else "FAIL") 27print(f"summary={passed}/{len(cases)}")
Output
1damaged electronics photos PASS 2unopened electronics return PASS 3refund payment method PASS 4summary=3/3

When you replace this lexical baseline with embeddings, the assertions stay useful: retrieve the correct source and retain text sufficient to answer.

Search small, answer with enough context

Some questions match a narrow sentence, while a faithful answer needs its surrounding section. A parent-child design indexes small children for search and stores a pointer to the larger source section returned for generation.

Diagram showing Clean page record, Damage policy section, Small indexed child, and Customer query. Diagram showing Clean page record, Damage policy section, Small indexed child, and Customer query.
Clean page record, Damage policy section, Small indexed child, and Customer query.
expand-child-match-to-parent-context.py
1parents = { 2 "damage": "Damaged electronics: report within 48 hours with photos. Agents must attach the order number.", 3 "unopened": "Unopened electronics: return within 14 days in original packaging.", 4} 5children = [ 6 {"text": "48 hours with photos", "parent_id": "damage"}, 7 {"text": "14 days original packaging", "parent_id": "unopened"}, 8] 9 10match = next(child for child in children if "photos" in child["text"]) 11print(match["text"]) 12print(parents[match["parent_id"]])
Output
148 hours with photos 2Damaged electronics: report within 48 hours with photos. Agents must attach the order number.

Child windows inside a longer section also benefit from their section label. Embedding a child with a contextual header is a cheap hypothesis to test against your labeled set, not a promise of improvement.

prepend-section-context.py
1def indexed_text(heading: str, text: str, source: str) -> str: 2 return f"Source: {source}\nSection: {heading}\n{text}" 3 4child = indexed_text( 5 heading="Damaged electronics", 6 text="Report within 48 hours with photos.", 7 source="ShopFlow Returns Policy", 8) 9print(child)
Output
1Source: ShopFlow Returns Policy 2Section: Damaged electronics 3Report within 48 hours with photos.

Escalate only when the baseline exposes a gap

Structure-aware chunks handle many handbooks and policy pages. Some corpora force different choices:

Failure after measuring baselineCandidate experimentWhat must still be checked
one paragraph shifts between multiple topicssemantic boundary detectionhard size cap and labeled-query score
tiny match lacks surrounding explanationparent-child expansiondeduplication and generation budget
short child is ambiguous without its sectioncontextual headerretrieval comparison against no-header baseline
meaning depends on far-away document contextlate chunkingmodel support, latency, and measured retrieval gain

Semantic chunking places candidate boundaries where neighboring sentence representations change sharply. The next small lab uses transparent topic vectors, so you can see the boundary without trusting an external embedding service.

find-semantic-boundary-candidates.py
1import math 2 3def vector(sentence: str) -> list[float]: 4 lowered = sentence.lower() 5 return [ 6 float(sum(word in lowered for word in ["damaged", "return", "photos", "hours"])), 7 float(sum(word in lowered for word in ["refund", "payment", "approved"])), 8 ] 9 10def cosine(left: list[float], right: list[float]) -> float: 11 dot = sum(a * b for a, b in zip(left, right)) 12 left_norm = math.sqrt(sum(a * a for a in left)) 13 right_norm = math.sqrt(sum(b * b for b in right)) 14 return dot / (left_norm * right_norm) if left_norm and right_norm else 0.0 15 16sentences = [ 17 "Damaged electronics require photos.", 18 "Report damaged items within 48 hours.", 19 "Approved refunds return to the original payment method.", 20] 21 22for left, right in zip(sentences, sentences[1:]): 23 similarity = cosine(vector(left), vector(right)) 24 print(f"{similarity:.2f}", "boundary" if similarity < 0.50 else "keep together")
Output
11.00 keep together 20.32 boundary

Late chunking is a different escalation. Günther et al. encode the longer text first and apply chunk pooling after the transformer's contextual token representations have been computed.[3] That means a child representation can carry information from surrounding document text, but it requires a compatible long-context embedding stack and must earn its additional cost in evaluation.

Comparison of early chunking versus late chunking: early splits the document before embedding, while late chunking embeds the full document first and pools token ranges afterward. Comparison of early chunking versus late chunking: early splits the document before embedding, while late chunking embeds the full document first and pools token ranges afterward.
Late chunking targets a specific gap: local chunks whose meaning depends on broader document context.

Ship a chunking decision, not a guess

For the ShopFlow policy corpus, a credible first release is straightforward:

Design choiceInitial decisionEvidence to collect
Default boundaryheading-aware sections, recursive fallbackanswerable-chunk rate and retrieval regression suite
Tableskeep header plus rows togetherexact-value questions preserve correct row meaning
Overlapoff for complete policy sections; test for fallback proseindex size, duplicate hits, and labeled-query results
Metadatasource, locator, heading, checksumcited answer can return to original record
Escalationparent expansion before semantic or late chunkingfailure examples that justify extra complexity

Your release gate can be encoded as an executable manifest check.

gate-indexable-policy-chunks.py
1from hashlib import sha256 2 3chunks = [ 4 { 5 "id": "damage", 6 "text": "Damaged electronics: report within 48 hours with photos.", 7 "source": "returns-policy-v3.pdf", 8 "locator": "page=7#damaged-electronics", 9 "heading": "Damaged electronics", 10 "quality": "ready", 11 }, 12 { 13 "id": "broken", 14 "text": "48 hours with photos.", 15 "source": "returns-policy-v3.pdf", 16 "locator": "page=7#fragment", 17 "heading": "Fragment", 18 "quality": "review", 19 }, 20] 21 22for chunk in chunks: 23 chunk["checksum"] = sha256(chunk["text"].encode()).hexdigest() 24 25required_metadata = ("source", "locator", "heading", "checksum") 26indexable = [ 27 chunk for chunk in chunks 28 if chunk["quality"] == "ready" 29 and all(chunk[field] for field in required_metadata) 30] 31print(f"indexable={[chunk['id'] for chunk in indexable]}") 32print(f"blocked={len(chunks) - len(indexable)}")
Output
1indexable=['damage'] 2blocked=1

Mastery check

You now know how to take the clean evidence records from ingestion and turn them into retrieval units that can be evaluated.

Key concepts

  • A chunk is a searchable evidence unit, not an arbitrary slice of text.
  • Fixed windows expose boundary and overlap costs clearly.
  • Structure-aware splitting is a strong baseline when headings or tables carry meaning.
  • Parent-child expansion separates precise search from sufficient answer context.
  • Semantic or late chunking should be escalations justified by measured failures.
  • A labeled retrieval set must check both correct selection and answerable evidence.

Evaluation rubric

  • Foundational: Identifies why a broken chunk can't answer the damaged-electronics question
  • Foundational: Implements fixed windows and explains what overlap repeats
  • Intermediate: Preserves headings, tables, source IDs, and locators in chunk records
  • Intermediate: Compares candidate boundaries on labeled retrieval cases
  • Intermediate: Uses parent-child expansion when narrow retrieval lacks sufficient context
  • Advanced: Chooses semantic or late chunking only after measuring a baseline failure

Follow-up questions

Common pitfalls

  • A deadline loses its condition: A window retrieves 48 hours without Damaged electronics. Fix: split at section boundaries and assert answerability.
  • A table row loses its header: A price or return window becomes ambiguous. Fix: keep column names with rows and test exact-value questions.
  • Overlap creates duplicates without fixing answers: Repeated chunks dominate top results. Fix: measure overlap against both complete evidence and index cost.
  • Chunks can't be cited: Search looks plausible but support can't defend the answer. Fix: retain source, locator, heading, and checksum metadata.
  • An advanced splitter is adopted by reputation: Complexity rises without better results. Fix: preserve a simple baseline and evaluate every escalation on the same labeled set.
Next Step
Continue to LLM Benchmarks & Limitations

You can now build evidence-level retrieval tests for a RAG system. Next you will learn why evaluation sets and scores need the same care before they can support a model-quality claim.

PreviousFile Ingestion for AI
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

RecursiveCharacterTextSplitter

LangChain · 2023

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models.

Günther, M., et al. · 2024 · arXiv preprint