LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAdvanced Retrieval & Enterprise MemoryChunking Strategies
🔍MediumRAG & Retrieval

Chunking Strategies

Compare document chunking approaches for RAG: fixed-size, semantic, recursive, and their impact on retrieval quality.

30 min readGoogle, Microsoft, Anthropic +19 key concepts

When building systems that answer questions using large documents, one of the most important decisions is how to break those documents into smaller pieces. This process, called chunking, directly impacts how accurately your system can find relevant information.

Imagine trying to find a specific recipe in a massive cookbook library. If you just search for "flour", you'll get thousands of pages. If you search for "chocolate cake", you want the specific page with the recipe, not the entire book it's in, nor just the line saying "2 cups flour". You need a chunk of text that is complete enough to be useful but specific enough to be relevant.

In Retrieval-Augmented Generation (RAG) systems, which combine information retrieval with large language models [1], chunking is the process of breaking down large documents into smaller, manageable pieces for retrieval. It is the most impactful design decision in your pipeline. Get it wrong, and even the best embedding model (a system that converts text into numerical vectors that capture its meaning) and Large Language Model (LLM) cannot recover the lost context.

Diagram Diagram
Chunking strategy comparison: fixed-size (simple but breaks sentences), recursive (respects structure), semantic (groups by meaning), and document-aware (uses headings). Chunking strategy comparison: fixed-size (simple but breaks sentences), recursive (respects structure), semantic (groups by meaning), and document-aware (uses headings).

The Fundamental Tradeoff

  • •Too small: chunks lack context, retrieval misses the point
  • •Too large: chunks contain irrelevant noise, embedding quality degrades
  • •Sweet spot: typically 256-1024 tokens, depending on the content

Chunking Strategies

1. Fixed-Size Chunking

This is the most common and simplest strategy. It ignores the document's structure and simply slices the text into windows of NNN tokens. An overlap window is crucial to ensure that sentences or concepts cut off at the boundary of one chunk are preserved in the next. Without overlap, relevant context at chunk boundaries is completely lost, which can critically impair downstream generation.

The following function demonstrates this approach. It takes a raw string, a tokenizer, and size parameters, and returns a list of string chunks. Notice how the loop steps forward by chunk_size - overlap to ensure continuity between adjacent chunks.

python
1from typing import List, Any 2 3def fixed_size_chunk(text: str, tokenizer: Any, chunk_size: int = 512, overlap: int = 50) -> List[str]: 4 """ 5 Splits text into fixed-size chunks with overlap. 6 Assumes tokenizer has .encode() and .decode() methods. 7 """ 8 tokens = tokenizer.encode(text) 9 chunks = [] 10 11 # Iterate with a step size that accounts for overlap 12 step = chunk_size - overlap 13 14 for i in range(0, len(tokens), step): 15 chunk_tokens = tokens[i:i + chunk_size] 16 chunks.append(tokenizer.decode(chunk_tokens)) 17 18 return chunks
ProsCons
Simple, predictableBreaks mid-sentence/paragraph
Uniform embedding qualityNo semantic awareness
Easy to indexMay split related content

2. Recursive Text Splitting

Instead of slicing text at arbitrary token boundaries, recursive text splitting attempts to respect the natural linguistic structure of the document. This technique, popularized by libraries like LangChain[2], uses a hierarchy of separators to find the most logical place to break a chunk. By prioritizing larger semantic units like paragraphs over smaller ones like sentences, it preserves the author's original flow of thought.

When a document is passed into a recursive splitter, the algorithm first tries to divide the text using the highest-level separator (typically double newlines, which indicate paragraph breaks). If the resulting pieces are smaller than the target chunk size, they are kept as-is. However, if a paragraph is still too large, the algorithm recursively applies the next separator in the hierarchy (such as single newlines or period-space combinations) to that specific piece.

This approach gracefully degrades. It will only split a sentence in half if the sentence itself exceeds the maximum chunk size, which is rare. As a result, the chunks fed into your embedding model are much more likely to be complete, coherent thoughts, which directly improves the quality of the resulting vector and the accuracy of downstream retrieval.

Here is an example using LangChain's built-in splitter. You configure the target size, overlap, and the ordered list of separators. It returns an instantiated splitter object that you can run over your documents to produce well-formed text chunks.

python
1from langchain.text_splitter import RecursiveCharacterTextSplitter 2 3# Splits first by double newlines (paragraphs), then single newlines, then sentences 4splitter = RecursiveCharacterTextSplitter( 5 chunk_size=512, 6 chunk_overlap=50, 7 separators=["\n\n", "\n", ". ", ", ", " ", ""], 8)

3. Semantic Chunking

Group sentences by semantic similarity, a method explored extensively by Kamradt[3]. Standard text splitters are blind to topic transitions. They might split a chunk exactly where the author transitions from describing a problem to proposing the solution, effectively separating the context from the answer. Semantic chunking solves this by treating the document as a continuous stream of semantic meaning.

The algorithm works by moving a sliding window over the document, embedding each sentence individually using a lightweight embedding model. It then calculates the cosine similarity between adjacent sentences. When the similarity drops below a predefined threshold, the algorithm detects a "semantic shift" (a change in topic) and places a chunk boundary there.

While this requires more upfront compute during ingestion, the resulting chunks are highly cohesive. This significantly improves the precision of similarity searches, as the embedding accurately reflects a single, unified concept.

This code illustrates a basic semantic chunking loop. It takes input text and an embedding model, generating embeddings for each sentence. By comparing adjacent sentence embeddings, it groups them into coherent chunks and returns the final list of semantic blocks.

python
1from sentence_transformers import SentenceTransformer 2from sklearn.metrics.pairwise import cosine_similarity 3from typing import List 4 5def semantic_chunk(text: str, model: SentenceTransformer, threshold: float = 0.7) -> List[str]: 6 """Split text into semantic chunks based on sentence embeddings.""" 7 # Simple split by punctuation (simplified for example) 8 sentences = [s.strip() for s in text.split('. ') if s] 9 10 # 1. Embed all sentences 11 embeddings = model.encode(sentences) 12 13 chunks = [] 14 current_chunk = [sentences[0]] 15 16 # 2. Iterate through sentences and merge if similar 17 for i in range(1, len(sentences)): 18 # Calculate similarity between current sentence and previous sentence embedding 19 similarity = cosine_similarity( 20 [embeddings[i-1]], 21 [embeddings[i]] 22 )[0][0] 23 24 if similarity > threshold: 25 current_chunk.append(sentences[i]) 26 else: 27 # Semantic shift detected; start a new chunk 28 chunks.append(" ".join(current_chunk)) 29 current_chunk = [sentences[i]] 30 31 chunks.append(" ".join(current_chunk)) 32 return chunks
ProsCons
Semantically coherent chunksSlower (requires embedding each sentence)
Variable size matches contentHard to control chunk sizes
Better retrieval qualityThreshold tuning required

4. Document-Aware Chunking

Use document structure (headings, sections, tables) to create meaningful chunks. This approach is central to frameworks like LlamaIndex[4] and preprocessing tools like Unstructured.io[5].

When we process enterprise documents like PDFs or Word files, they contain rich structural metadata: H1 titles, H2 subtitles, bulleted lists, and tables. A naive text splitter destroys this structure, potentially mixing a section about 'API Authentication' with 'Billing Rates'. Document-aware chunking parses the file into an Abstract Syntax Tree (AST) first, identifying the exact boundaries of each section before any text splitting occurs.

If a specific section (like an H3 subsection) is smaller than the target chunk size, it is embedded as a single cohesive unit. If the section is too large, it can be recursively split, but the chunks retain the structural context (the heading hierarchy) as metadata. This ensures that when the LLM receives the chunk, it understands exactly where the information lived within the broader document architecture.

The snippet below simulates a document-aware process. It takes a parsed markdown document and iterates through its sections. It returns a list of chunk dictionaries, where each chunk preserves its structural metadata alongside the text.

python
1from dataclasses import dataclass 2from typing import List, Dict, Any 3 4@dataclass 5class Section: 6 content: str 7 heading: str 8 level: int 9 10# Mock helper functions for demonstration 11def split_by_headings(markdown: str) -> List[Section]: 12 """Parse markdown into sections.""" 13 return [] 14 15def count_tokens(text: str) -> int: 16 """Return an approximate token count.""" 17 return len(text.split()) 18 19def recursive_split(text: str, max_size: int) -> List[str]: 20 """Split text recursively up to max_size.""" 21 return [text[i:i+max_size] for i in range(0, len(text), max_size)] 22 23def document_aware_chunk(markdown: str) -> List[Dict[str, Any]]: 24 """Chunk a document while preserving section boundaries.""" 25 chunks = [] 26 sections = split_by_headings(markdown) 27 28 for section in sections: 29 if count_tokens(section.content) <= 1024: 30 # Small sections become single chunks 31 chunks.append({ 32 "text": section.content, 33 "heading": section.heading, 34 "level": section.level, 35 }) 36 else: 37 # Recursively split large sections, but keep the heading context 38 sub_chunks = recursive_split(section.content, max_size=512) 39 for sub in sub_chunks: 40 chunks.append({ 41 "text": sub, 42 "heading": section.heading, 43 "level": section.level, 44 }) 45 return chunks

5. Agentic / Proposition Chunking

Instead of arbitrary text boundaries, Proposition Chunking extracts atomic, standalone statements (propositions) from the text using an LLM [6]. When humans write, we use pronouns, coreferences, and compound sentences to make the text flow naturally. However, embedding models struggle with this density. If a chunk says "It was deployed in 2023 and reduced latency by 40%", the vector lacks the crucial noun.

Proposition chunking uses an LLM to parse the raw text and generate a list of atomic facts where all pronouns are resolved to their proper entities. Because each proposition contains exactly one fact, the embedding vector is incredibly sharp and focused. When a user asks a specific factual question, the cosine similarity between the query and the relevant proposition is much higher than it would be with a dense, multi-topic paragraph chunk. The primary drawback is the significant cost and latency introduced by requiring an LLM inference step for every paragraph in your knowledge base during ingestion.

To implement this, you pass your raw text to an LLM along with a strict system prompt. The prompt instructs the model to decompose the text and resolve pronouns. The output is a clean list of atomic facts ready for embedding.

python
1PROMPT = """ 2Decompose the following text into clear, simple propositions. 3Each proposition must be a self-contained sentence that makes sense 4without the surrounding context. Resolve pronouns (it, he, they) to specific nouns. 5 6Text: "Paris, the capital of France, has 2.1M people and is known for the Eiffel Tower." 7Propositions: 81. Paris is the capital of France. 92. Paris has a population of 2.1 million people. 103. Paris is known for the Eiffel Tower. 11"""
Input TextExtracted Propositions (Chunks)
"The 2023 roadmap focuses on Q3 deliverables.""The 2023 roadmap focuses on deliverables for the third quarter."
"It also mentions a hiring freeze.""The 2023 roadmap mentions a hiring freeze."

Comparison

Choosing the right chunking strategy requires balancing implementation complexity with retrieval quality. The table below summarizes how these methods compare across different dimensions, helping you select the best approach for your specific workload.

Document chunking pipeline: a long document is split into overlapping chunks, each chunk is embedded independently, and metadata like section headers are preserved. Document chunking pipeline: a long document is split into overlapping chunks, each chunk is embedded independently, and metadata like section headers are preserved.
StrategyBest ForChunk QualitySpeedComplexity
Fixed-sizeUniform content (logs, code)MediumFast ✅Low ✅
RecursiveGeneral textGood ✅FastLow
SemanticVaried topics in long docsVery GoodSlowMedium
Document-awareStructured docs (manuals, wikis)Excellent ✅MediumHigh
PropositionFact-heavy contentExcellentVery SlowVery High

Chunk Size Guidelines

There is no universally perfect chunk size, as the ideal configuration depends heavily on the structure and density of your documents. The following guidelines provide starting points based on common content types.

Content TypeRecommended SizeOverlap
Technical docs512-1024 tokens50-100
Legal contracts256-512 tokens50
FAQ/Knowledge base128-256 tokens0
CodeFunction/class level0
Chat transcriptsPer-turn1-2 turns context

Advanced: Parent-Child Retrieval

One of the fundamental tensions in RAG is that embedding models prefer small, highly specific chunks to achieve high retrieval precision, while generative LLMs need broad, expansive context to synthesize high-quality answers. Parent-child retrieval (also known as auto-merging retrieval or small-to-big retrieval) resolves this tension by decoupling the chunk used for search from the chunk used for generation.

In this architecture, a large "parent" chunk (e.g., an entire section of 2000 tokens) is divided into multiple smaller "child" chunks (e.g., 200 tokens each). Only the child chunks are embedded and indexed in the vector database. However, each child chunk maintains a pointer to its parent.

When a user submits a query, the vector search finds the most relevant child chunks with high precision. Before passing the context to the LLM, the system follows the pointer back to the parent chunk and retrieves the surrounding context. If multiple child chunks from the same parent are retrieved, the system can deduplicate and simply pass the parent. This gives the LLM the full narrative context it needs to generate a complete answer.

Diagram Diagram

Metadata Enrichment

When a document is shattered into hundreds of pieces, each individual chunk loses its connection to the broader work. A chunk containing the sentence "Restart the server to apply changes" is useless if the system does not know which server or which application the document refers to. Injecting metadata directly into the vector database payload ensures this context survives the chunking process.

Robust metadata schemas should capture source tracking (where did this come from?), structural context (where was this within the document?), and temporal data (when was this written?). This allows the retrieval system to perform hybrid search: filtering vectors by exact metadata matches before calculating semantic similarity. This drastically reduces the search space and eliminates hallucinated answers from outdated documentation.

This function enriches a raw text chunk with contextual metadata. It takes the text, source document metadata, and index position, returning a comprehensive dictionary payload. This payload is what you ultimately store in your vector database.

python
1import re 2from typing import Dict, Any 3 4class Document: 5 id: str 6 title: str 7 url: str 8 total_chunks: int 9 created_at: str 10 updated_at: str 11 def get_page(self, index: int) -> int: 12 """Return the page number for a given chunk index.""" 13 return 1 14 15def enrich_chunk(chunk_text: str, source_doc: Document, 16 chunk_index: int) -> Dict[str, Any]: 17 """Add contextual metadata to improve retrieval and attribution.""" 18 19 # Heuristic for token count (approx) 20 token_count = len(chunk_text.split()) * 1.3 21 22 return { 23 "text": chunk_text, 24 25 # Source tracking 26 "document_id": source_doc.id, 27 "document_title": source_doc.title, 28 "source_url": source_doc.url, 29 "page_number": source_doc.get_page(chunk_index), 30 31 # Structural context 32 "chunk_index": chunk_index, 33 "total_chunks": source_doc.total_chunks, 34 35 # Temporal metadata 36 "created_at": source_doc.created_at, 37 "updated_at": source_doc.updated_at, 38 39 # Content properties 40 "token_count": int(token_count), 41 "has_code": bool(re.search(r'```', chunk_text)), 42 "has_table": bool(re.search(r'\|.*\|', chunk_text)), 43 }

Contextual Header Prepending

While storing metadata in the vector database is essential for filtering, it does not help the embedding model understand the chunk itself unless that metadata is injected directly into the text payload before the vector is computed. Contextual header prepending is a lightweight strategy to achieve this. By concatenating the document title and the hierarchical section path (e.g., Title > H1 > H2) to the top of every chunk, you artificially ground the text.

This technique is remarkably powerful for short chunks that contain ambiguous terms. For example, a chunk that simply reads "The API rate limit is 50 requests per minute" might embed generically. But if it is prepended to read "Document: Stripe Billing API > Section: Throttling", the resulting vector will cluster tightly with queries specifically about Stripe's billing rate limits. This approach often yields significant improvements in retrieval recall with almost zero implementation complexity.

Here is a straightforward implementation of contextual prepending. It takes a chunk dictionary containing text and metadata, and returns a new string where the hierarchical context is concatenated directly above the main content.

python
1from typing import Dict, Any 2 3def contextualize_chunk(chunk: Dict[str, Any]) -> str: 4 """Prepend hierarchical context for better embedding quality.""" 5 context_parts = [] 6 7 if chunk.get("document_title"): 8 context_parts.append(f"Document: {chunk['document_title']}") 9 if chunk.get("section_heading"): 10 context_parts.append(f"Section: {chunk['section_heading']}") 11 12 context = " > ".join(context_parts) 13 return f"{context}\n\n{chunk['text']}" 14 15# Before: "The default port is 8080. Configure it in application.yaml." 16# After: "Document: Deployment Guide > Section: Configuration\n\n 17# The default port is 8080. Configure it in application.yaml."

Late Chunking

In a standard RAG pipeline, the text is split into chunks before being passed to the embedding model. This "early chunking" means the embedding model operates with a severely restricted context window. The attention mechanism within the Transformer (the foundational neural network architecture for most modern LLMs) can only see the tokens inside that specific chunk, leaving it blind to the broader narrative of the document.

Late chunking (introduced by Jina AI, 2024) reverses this order. It passes the entire document (up to the model's context limit, e.g., 8k or 32k tokens) through the embedding model first [7]. Because the Transformer applies bidirectional self-attention across the whole sequence, every token's contextualized representation is influenced by the entire document.

Only after this full-context forward pass does the system split the sequence into boundaries and apply mean pooling to generate the final chunk vectors. The resulting embeddings capture the specific details of the chunk while remaining deeply grounded in the document's overall semantic landscape.

Diagram Diagram
MethodContext WindowRetrieval QualitySpeed
Traditional chunking + embeddingChunk onlyBaselineFast
Late chunkingFull documentImproved recallSlower (long context inference)
Contextual prependingChunk + header+5-15% recallSame as baseline

Evaluating Chunking Quality

Without rigorous evaluation, tuning chunk sizes and overlap is entirely guesswork. A robust evaluation framework requires a dataset of synthetic or historical user queries mapped to the exact document IDs that contain the answers. By running these queries through your retrieval pipeline, you can measure the impact of different chunking strategies on metrics like Recall@K.

When evaluating, you will typically see a tradeoff curve. Smaller chunks might increase Recall@5 for highly specific fact-based queries but cause the LLM generation step to fail because the surrounding context was lost. Conversely, massive chunks might reduce your vector search recall but improve the LLM's synthesis quality when the correct chunk is found.

Production systems should use metrics that evaluate both the retrieval step and the final generation step (using frameworks like RAGAS[8]). By measuring end-to-end performance, you can empirically determine the chunk size and overlap that best serves your specific document corpus and query distribution.

The evaluation script below measures retrieval effectiveness. It takes your generated chunks, a set of test queries, and known relevant document IDs, then computes the Recall@5 score by checking if the top retrieved chunks belong to the correct documents.

python
1from sklearn.metrics.pairwise import cosine_similarity 2import numpy as np 3from typing import List, Dict, Any 4 5def evaluate_chunking( 6 chunks: List[Dict[str, Any]], 7 eval_queries: List[str], 8 eval_labels: List[List[str]], 9 embedding_model: Any 10) -> Dict[str, float]: 11 """Measure how well chunking supports retrieval.""" 12 13 # 1. Embed all chunks 14 chunk_texts = [c["text"] for c in chunks] 15 chunk_embeddings = embedding_model.encode(chunk_texts) 16 17 metrics: Dict[str, List[float]] = { 18 "recall@5": [], 19 } 20 avg_chunk_tokens = [] 21 22 for query, relevant_doc_ids in zip(eval_queries, eval_labels): 23 query_emb = embedding_model.encode([query]) # Ensure 2D shape (1, dim) 24 25 # Calculate similarity (1, num_chunks) 26 similarities = cosine_similarity(query_emb, chunk_embeddings)[0] 27 28 # Get top-5 indices 29 top_5_indices = similarities.argsort()[-5:][::-1] 30 31 # Check if any retrieved chunk comes from a relevant document 32 retrieved_docs = {chunks[i]["document_id"] for i in top_5_indices} 33 34 # Strict recall: is at least one relevant doc retrieved? 35 is_hit = len(retrieved_docs.intersection(set(relevant_doc_ids))) > 0 36 metrics["recall@5"].append(1.0 if is_hit else 0.0) 37 38 avg_tokens = np.mean([len(c["text"].split()) for c in chunks]) 39 40 return { 41 "recall@5": float(np.mean(metrics["recall@5"])), 42 "avg_chunk_tokens": float(avg_tokens) 43 }

🎯 Production tip: The most impactful chunking improvement isn't changing the algorithm. It is adding metadata. Teams that add document title, section heading, and contextual prepending to their chunks see 5-15% recall improvement with zero latency cost. Do this before investing in slower semantic chunking or proposition extraction.


Common Pitfalls

  • •Using fixed-size chunking for structured documents. You will split tables, code blocks, and lists mid-way.
  • •Not including overlap. Without overlap, relevant context at chunk boundaries is lost.
  • •Making chunks too small. Chunks under 100 tokens often lack sufficient context for meaningful embedding.
  • •Ignoring metadata. Chunks without source information (document title, section heading, page number) are hard to trace and rank.
  • •Not evaluating chunking quality. Teams guess at chunk sizes instead of measuring retrieval recall.

Key Takeaways

  1. •The tradeoff: Too small vs too large; sweet spot depends on content (typically 256-512 tokens).
  2. •Strategies: Use Recursive for general text, Semantic for messy text, Document-aware for structured docs.
  3. •Metadata is critical. Enriching chunks with document titles and headers is a low-hanging fruit for recall improvement.
  4. •Advanced methods: Use Parent-Child retrieval for high precision combined with high context. Consider Late Chunking for context-heavy documents.
  5. •Evaluation: Measure retrieval recall before optimizing chunking strategies.
Evaluation Rubric
  • 1
    Explains fixed-size chunking with configurable overlap
  • 2
    Describes semantic chunking using embedding similarity
  • 3
    Discusses recursive splitting that respects document structure
  • 4
    Analyzes chunk size tradeoffs: granularity vs context completeness
  • 5
    Proposes metadata strategies for improved retrieval
Common Pitfalls
  • Using one-size-fits-all chunking
  • Ignoring document structure (headers, sections)
  • Not including metadata with chunks
  • Choosing chunk size without empirical testing
Follow-up Questions to Expect

Key Concepts Tested
Fixed-size chunking with overlapRecursive text splittingSemantic chunking with embeddingsDocument-aware splitting (Markdown/HTML)Proposition chunking via LLMsParent-child retrieval architectureLate chunking vs. early chunkingContextual metadata enrichmentRetrieval recall evaluation
References

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Günther, M., et al. · 2024 · arXiv preprint

Dense X Retrieval: What Retrieval Granularity Should We Use?

Chen, T., et al. · 2023 · arXiv preprint

5 Levels of Text Splitting.

Kamradt, G. · 2023

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P., et al. · 2020 · NeurIPS 2020

RecursiveCharacterTextSplitter

LangChain · 2023

Unstructured: The Ultimate ETL for LLMs

Unstructured.io · 2024

LlamaIndex: A Data Framework for LLM Applications

Liu, J. · 2024

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Your account is free and you can post anonymously if you choose.