Compare document chunking approaches for RAG: fixed-size, semantic, recursive, and their impact on retrieval quality.
When building systems that answer questions using large documents, one of the most important decisions is how to break those documents into smaller pieces. This process, called chunking, directly impacts how accurately your system can find relevant information.
Imagine trying to find a specific recipe in a massive cookbook library. If you just search for "flour", you'll get thousands of pages. If you search for "chocolate cake", you want the specific page with the recipe, not the entire book it's in, nor just the line saying "2 cups flour". You need a chunk of text that is complete enough to be useful but specific enough to be relevant.
In Retrieval-Augmented Generation (RAG) systems, which combine information retrieval with large language models [1], chunking is the process of breaking down large documents into smaller, manageable pieces for retrieval. It is the most impactful design decision in your pipeline. Get it wrong, and even the best embedding model (a system that converts text into numerical vectors that capture its meaning) and Large Language Model (LLM) cannot recover the lost context.
This is the most common and simplest strategy. It ignores the document's structure and simply slices the text into windows of tokens. An overlap window is crucial to ensure that sentences or concepts cut off at the boundary of one chunk are preserved in the next. Without overlap, relevant context at chunk boundaries is completely lost, which can critically impair downstream generation.
The following function demonstrates this approach. It takes a raw string, a tokenizer, and size parameters, and returns a list of string chunks. Notice how the loop steps forward by chunk_size - overlap to ensure continuity between adjacent chunks.
python1from typing import List, Any 2 3def fixed_size_chunk(text: str, tokenizer: Any, chunk_size: int = 512, overlap: int = 50) -> List[str]: 4 """ 5 Splits text into fixed-size chunks with overlap. 6 Assumes tokenizer has .encode() and .decode() methods. 7 """ 8 tokens = tokenizer.encode(text) 9 chunks = [] 10 11 # Iterate with a step size that accounts for overlap 12 step = chunk_size - overlap 13 14 for i in range(0, len(tokens), step): 15 chunk_tokens = tokens[i:i + chunk_size] 16 chunks.append(tokenizer.decode(chunk_tokens)) 17 18 return chunks
| Pros | Cons |
|---|---|
| Simple, predictable | Breaks mid-sentence/paragraph |
| Uniform embedding quality | No semantic awareness |
| Easy to index | May split related content |
Instead of slicing text at arbitrary token boundaries, recursive text splitting attempts to respect the natural linguistic structure of the document. This technique, popularized by libraries like LangChain[2], uses a hierarchy of separators to find the most logical place to break a chunk. By prioritizing larger semantic units like paragraphs over smaller ones like sentences, it preserves the author's original flow of thought.
When a document is passed into a recursive splitter, the algorithm first tries to divide the text using the highest-level separator (typically double newlines, which indicate paragraph breaks). If the resulting pieces are smaller than the target chunk size, they are kept as-is. However, if a paragraph is still too large, the algorithm recursively applies the next separator in the hierarchy (such as single newlines or period-space combinations) to that specific piece.
This approach gracefully degrades. It will only split a sentence in half if the sentence itself exceeds the maximum chunk size, which is rare. As a result, the chunks fed into your embedding model are much more likely to be complete, coherent thoughts, which directly improves the quality of the resulting vector and the accuracy of downstream retrieval.
Here is an example using LangChain's built-in splitter. You configure the target size, overlap, and the ordered list of separators. It returns an instantiated splitter object that you can run over your documents to produce well-formed text chunks.
python1from langchain.text_splitter import RecursiveCharacterTextSplitter 2 3# Splits first by double newlines (paragraphs), then single newlines, then sentences 4splitter = RecursiveCharacterTextSplitter( 5 chunk_size=512, 6 chunk_overlap=50, 7 separators=["\n\n", "\n", ". ", ", ", " ", ""], 8)
Group sentences by semantic similarity, a method explored extensively by Kamradt[3]. Standard text splitters are blind to topic transitions. They might split a chunk exactly where the author transitions from describing a problem to proposing the solution, effectively separating the context from the answer. Semantic chunking solves this by treating the document as a continuous stream of semantic meaning.
The algorithm works by moving a sliding window over the document, embedding each sentence individually using a lightweight embedding model. It then calculates the cosine similarity between adjacent sentences. When the similarity drops below a predefined threshold, the algorithm detects a "semantic shift" (a change in topic) and places a chunk boundary there.
While this requires more upfront compute during ingestion, the resulting chunks are highly cohesive. This significantly improves the precision of similarity searches, as the embedding accurately reflects a single, unified concept.
This code illustrates a basic semantic chunking loop. It takes input text and an embedding model, generating embeddings for each sentence. By comparing adjacent sentence embeddings, it groups them into coherent chunks and returns the final list of semantic blocks.
python1from sentence_transformers import SentenceTransformer 2from sklearn.metrics.pairwise import cosine_similarity 3from typing import List 4 5def semantic_chunk(text: str, model: SentenceTransformer, threshold: float = 0.7) -> List[str]: 6 """Split text into semantic chunks based on sentence embeddings.""" 7 # Simple split by punctuation (simplified for example) 8 sentences = [s.strip() for s in text.split('. ') if s] 9 10 # 1. Embed all sentences 11 embeddings = model.encode(sentences) 12 13 chunks = [] 14 current_chunk = [sentences[0]] 15 16 # 2. Iterate through sentences and merge if similar 17 for i in range(1, len(sentences)): 18 # Calculate similarity between current sentence and previous sentence embedding 19 similarity = cosine_similarity( 20 [embeddings[i-1]], 21 [embeddings[i]] 22 )[0][0] 23 24 if similarity > threshold: 25 current_chunk.append(sentences[i]) 26 else: 27 # Semantic shift detected; start a new chunk 28 chunks.append(" ".join(current_chunk)) 29 current_chunk = [sentences[i]] 30 31 chunks.append(" ".join(current_chunk)) 32 return chunks
| Pros | Cons |
|---|---|
| Semantically coherent chunks | Slower (requires embedding each sentence) |
| Variable size matches content | Hard to control chunk sizes |
| Better retrieval quality | Threshold tuning required |
Use document structure (headings, sections, tables) to create meaningful chunks. This approach is central to frameworks like LlamaIndex[4] and preprocessing tools like Unstructured.io[5].
When we process enterprise documents like PDFs or Word files, they contain rich structural metadata: H1 titles, H2 subtitles, bulleted lists, and tables. A naive text splitter destroys this structure, potentially mixing a section about 'API Authentication' with 'Billing Rates'. Document-aware chunking parses the file into an Abstract Syntax Tree (AST) first, identifying the exact boundaries of each section before any text splitting occurs.
If a specific section (like an H3 subsection) is smaller than the target chunk size, it is embedded as a single cohesive unit. If the section is too large, it can be recursively split, but the chunks retain the structural context (the heading hierarchy) as metadata. This ensures that when the LLM receives the chunk, it understands exactly where the information lived within the broader document architecture.
The snippet below simulates a document-aware process. It takes a parsed markdown document and iterates through its sections. It returns a list of chunk dictionaries, where each chunk preserves its structural metadata alongside the text.
python1from dataclasses import dataclass 2from typing import List, Dict, Any 3 4@dataclass 5class Section: 6 content: str 7 heading: str 8 level: int 9 10# Mock helper functions for demonstration 11def split_by_headings(markdown: str) -> List[Section]: 12 """Parse markdown into sections.""" 13 return [] 14 15def count_tokens(text: str) -> int: 16 """Return an approximate token count.""" 17 return len(text.split()) 18 19def recursive_split(text: str, max_size: int) -> List[str]: 20 """Split text recursively up to max_size.""" 21 return [text[i:i+max_size] for i in range(0, len(text), max_size)] 22 23def document_aware_chunk(markdown: str) -> List[Dict[str, Any]]: 24 """Chunk a document while preserving section boundaries.""" 25 chunks = [] 26 sections = split_by_headings(markdown) 27 28 for section in sections: 29 if count_tokens(section.content) <= 1024: 30 # Small sections become single chunks 31 chunks.append({ 32 "text": section.content, 33 "heading": section.heading, 34 "level": section.level, 35 }) 36 else: 37 # Recursively split large sections, but keep the heading context 38 sub_chunks = recursive_split(section.content, max_size=512) 39 for sub in sub_chunks: 40 chunks.append({ 41 "text": sub, 42 "heading": section.heading, 43 "level": section.level, 44 }) 45 return chunks
Instead of arbitrary text boundaries, Proposition Chunking extracts atomic, standalone statements (propositions) from the text using an LLM [6]. When humans write, we use pronouns, coreferences, and compound sentences to make the text flow naturally. However, embedding models struggle with this density. If a chunk says "It was deployed in 2023 and reduced latency by 40%", the vector lacks the crucial noun.
Proposition chunking uses an LLM to parse the raw text and generate a list of atomic facts where all pronouns are resolved to their proper entities. Because each proposition contains exactly one fact, the embedding vector is incredibly sharp and focused. When a user asks a specific factual question, the cosine similarity between the query and the relevant proposition is much higher than it would be with a dense, multi-topic paragraph chunk. The primary drawback is the significant cost and latency introduced by requiring an LLM inference step for every paragraph in your knowledge base during ingestion.
To implement this, you pass your raw text to an LLM along with a strict system prompt. The prompt instructs the model to decompose the text and resolve pronouns. The output is a clean list of atomic facts ready for embedding.
python1PROMPT = """ 2Decompose the following text into clear, simple propositions. 3Each proposition must be a self-contained sentence that makes sense 4without the surrounding context. Resolve pronouns (it, he, they) to specific nouns. 5 6Text: "Paris, the capital of France, has 2.1M people and is known for the Eiffel Tower." 7Propositions: 81. Paris is the capital of France. 92. Paris has a population of 2.1 million people. 103. Paris is known for the Eiffel Tower. 11"""
| Input Text | Extracted Propositions (Chunks) |
|---|---|
| "The 2023 roadmap focuses on Q3 deliverables." | "The 2023 roadmap focuses on deliverables for the third quarter." |
| "It also mentions a hiring freeze." | "The 2023 roadmap mentions a hiring freeze." |
Choosing the right chunking strategy requires balancing implementation complexity with retrieval quality. The table below summarizes how these methods compare across different dimensions, helping you select the best approach for your specific workload.
| Strategy | Best For | Chunk Quality | Speed | Complexity |
|---|---|---|---|---|
| Fixed-size | Uniform content (logs, code) | Medium | Fast ✅ | Low ✅ |
| Recursive | General text | Good ✅ | Fast | Low |
| Semantic | Varied topics in long docs | Very Good | Slow | Medium |
| Document-aware | Structured docs (manuals, wikis) | Excellent ✅ | Medium | High |
| Proposition | Fact-heavy content | Excellent | Very Slow | Very High |
There is no universally perfect chunk size, as the ideal configuration depends heavily on the structure and density of your documents. The following guidelines provide starting points based on common content types.
| Content Type | Recommended Size | Overlap |
|---|---|---|
| Technical docs | 512-1024 tokens | 50-100 |
| Legal contracts | 256-512 tokens | 50 |
| FAQ/Knowledge base | 128-256 tokens | 0 |
| Code | Function/class level | 0 |
| Chat transcripts | Per-turn | 1-2 turns context |
One of the fundamental tensions in RAG is that embedding models prefer small, highly specific chunks to achieve high retrieval precision, while generative LLMs need broad, expansive context to synthesize high-quality answers. Parent-child retrieval (also known as auto-merging retrieval or small-to-big retrieval) resolves this tension by decoupling the chunk used for search from the chunk used for generation.
In this architecture, a large "parent" chunk (e.g., an entire section of 2000 tokens) is divided into multiple smaller "child" chunks (e.g., 200 tokens each). Only the child chunks are embedded and indexed in the vector database. However, each child chunk maintains a pointer to its parent.
When a user submits a query, the vector search finds the most relevant child chunks with high precision. Before passing the context to the LLM, the system follows the pointer back to the parent chunk and retrieves the surrounding context. If multiple child chunks from the same parent are retrieved, the system can deduplicate and simply pass the parent. This gives the LLM the full narrative context it needs to generate a complete answer.
When a document is shattered into hundreds of pieces, each individual chunk loses its connection to the broader work. A chunk containing the sentence "Restart the server to apply changes" is useless if the system does not know which server or which application the document refers to. Injecting metadata directly into the vector database payload ensures this context survives the chunking process.
Robust metadata schemas should capture source tracking (where did this come from?), structural context (where was this within the document?), and temporal data (when was this written?). This allows the retrieval system to perform hybrid search: filtering vectors by exact metadata matches before calculating semantic similarity. This drastically reduces the search space and eliminates hallucinated answers from outdated documentation.
This function enriches a raw text chunk with contextual metadata. It takes the text, source document metadata, and index position, returning a comprehensive dictionary payload. This payload is what you ultimately store in your vector database.
python1import re 2from typing import Dict, Any 3 4class Document: 5 id: str 6 title: str 7 url: str 8 total_chunks: int 9 created_at: str 10 updated_at: str 11 def get_page(self, index: int) -> int: 12 """Return the page number for a given chunk index.""" 13 return 1 14 15def enrich_chunk(chunk_text: str, source_doc: Document, 16 chunk_index: int) -> Dict[str, Any]: 17 """Add contextual metadata to improve retrieval and attribution.""" 18 19 # Heuristic for token count (approx) 20 token_count = len(chunk_text.split()) * 1.3 21 22 return { 23 "text": chunk_text, 24 25 # Source tracking 26 "document_id": source_doc.id, 27 "document_title": source_doc.title, 28 "source_url": source_doc.url, 29 "page_number": source_doc.get_page(chunk_index), 30 31 # Structural context 32 "chunk_index": chunk_index, 33 "total_chunks": source_doc.total_chunks, 34 35 # Temporal metadata 36 "created_at": source_doc.created_at, 37 "updated_at": source_doc.updated_at, 38 39 # Content properties 40 "token_count": int(token_count), 41 "has_code": bool(re.search(r'```', chunk_text)), 42 "has_table": bool(re.search(r'\|.*\|', chunk_text)), 43 }
While storing metadata in the vector database is essential for filtering, it does not help the embedding model understand the chunk itself unless that metadata is injected directly into the text payload before the vector is computed. Contextual header prepending is a lightweight strategy to achieve this. By concatenating the document title and the hierarchical section path (e.g., Title > H1 > H2) to the top of every chunk, you artificially ground the text.
This technique is remarkably powerful for short chunks that contain ambiguous terms. For example, a chunk that simply reads "The API rate limit is 50 requests per minute" might embed generically. But if it is prepended to read "Document: Stripe Billing API > Section: Throttling", the resulting vector will cluster tightly with queries specifically about Stripe's billing rate limits. This approach often yields significant improvements in retrieval recall with almost zero implementation complexity.
Here is a straightforward implementation of contextual prepending. It takes a chunk dictionary containing text and metadata, and returns a new string where the hierarchical context is concatenated directly above the main content.
python1from typing import Dict, Any 2 3def contextualize_chunk(chunk: Dict[str, Any]) -> str: 4 """Prepend hierarchical context for better embedding quality.""" 5 context_parts = [] 6 7 if chunk.get("document_title"): 8 context_parts.append(f"Document: {chunk['document_title']}") 9 if chunk.get("section_heading"): 10 context_parts.append(f"Section: {chunk['section_heading']}") 11 12 context = " > ".join(context_parts) 13 return f"{context}\n\n{chunk['text']}" 14 15# Before: "The default port is 8080. Configure it in application.yaml." 16# After: "Document: Deployment Guide > Section: Configuration\n\n 17# The default port is 8080. Configure it in application.yaml."
In a standard RAG pipeline, the text is split into chunks before being passed to the embedding model. This "early chunking" means the embedding model operates with a severely restricted context window. The attention mechanism within the Transformer (the foundational neural network architecture for most modern LLMs) can only see the tokens inside that specific chunk, leaving it blind to the broader narrative of the document.
Late chunking (introduced by Jina AI, 2024) reverses this order. It passes the entire document (up to the model's context limit, e.g., 8k or 32k tokens) through the embedding model first [7]. Because the Transformer applies bidirectional self-attention across the whole sequence, every token's contextualized representation is influenced by the entire document.
Only after this full-context forward pass does the system split the sequence into boundaries and apply mean pooling to generate the final chunk vectors. The resulting embeddings capture the specific details of the chunk while remaining deeply grounded in the document's overall semantic landscape.
| Method | Context Window | Retrieval Quality | Speed |
|---|---|---|---|
| Traditional chunking + embedding | Chunk only | Baseline | Fast |
| Late chunking | Full document | Improved recall | Slower (long context inference) |
| Contextual prepending | Chunk + header | +5-15% recall | Same as baseline |
Without rigorous evaluation, tuning chunk sizes and overlap is entirely guesswork. A robust evaluation framework requires a dataset of synthetic or historical user queries mapped to the exact document IDs that contain the answers. By running these queries through your retrieval pipeline, you can measure the impact of different chunking strategies on metrics like Recall@K.
When evaluating, you will typically see a tradeoff curve. Smaller chunks might increase Recall@5 for highly specific fact-based queries but cause the LLM generation step to fail because the surrounding context was lost. Conversely, massive chunks might reduce your vector search recall but improve the LLM's synthesis quality when the correct chunk is found.
Production systems should use metrics that evaluate both the retrieval step and the final generation step (using frameworks like RAGAS[8]). By measuring end-to-end performance, you can empirically determine the chunk size and overlap that best serves your specific document corpus and query distribution.
The evaluation script below measures retrieval effectiveness. It takes your generated chunks, a set of test queries, and known relevant document IDs, then computes the Recall@5 score by checking if the top retrieved chunks belong to the correct documents.
python1from sklearn.metrics.pairwise import cosine_similarity 2import numpy as np 3from typing import List, Dict, Any 4 5def evaluate_chunking( 6 chunks: List[Dict[str, Any]], 7 eval_queries: List[str], 8 eval_labels: List[List[str]], 9 embedding_model: Any 10) -> Dict[str, float]: 11 """Measure how well chunking supports retrieval.""" 12 13 # 1. Embed all chunks 14 chunk_texts = [c["text"] for c in chunks] 15 chunk_embeddings = embedding_model.encode(chunk_texts) 16 17 metrics: Dict[str, List[float]] = { 18 "recall@5": [], 19 } 20 avg_chunk_tokens = [] 21 22 for query, relevant_doc_ids in zip(eval_queries, eval_labels): 23 query_emb = embedding_model.encode([query]) # Ensure 2D shape (1, dim) 24 25 # Calculate similarity (1, num_chunks) 26 similarities = cosine_similarity(query_emb, chunk_embeddings)[0] 27 28 # Get top-5 indices 29 top_5_indices = similarities.argsort()[-5:][::-1] 30 31 # Check if any retrieved chunk comes from a relevant document 32 retrieved_docs = {chunks[i]["document_id"] for i in top_5_indices} 33 34 # Strict recall: is at least one relevant doc retrieved? 35 is_hit = len(retrieved_docs.intersection(set(relevant_doc_ids))) > 0 36 metrics["recall@5"].append(1.0 if is_hit else 0.0) 37 38 avg_tokens = np.mean([len(c["text"].split()) for c in chunks]) 39 40 return { 41 "recall@5": float(np.mean(metrics["recall@5"])), 42 "avg_chunk_tokens": float(avg_tokens) 43 }
🎯 Production tip: The most impactful chunking improvement isn't changing the algorithm. It is adding metadata. Teams that add document title, section heading, and contextual prepending to their chunks see 5-15% recall improvement with zero latency cost. Do this before investing in slower semantic chunking or proposition extraction.
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
Günther, M., et al. · 2024 · arXiv preprint
Dense X Retrieval: What Retrieval Granularity Should We Use?
Chen, T., et al. · 2023 · arXiv preprint
5 Levels of Text Splitting.
Kamradt, G. · 2023
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis, P., et al. · 2020 · NeurIPS 2020
RecursiveCharacterTextSplitter
LangChain · 2023
Unstructured: The Ultimate ETL for LLMs
Unstructured.io · 2024
LlamaIndex: A Data Framework for LLM Applications
Liu, J. · 2024
RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Es, S., et al. · 2023 · arXiv preprint