Deep dive into document chunking approaches for RAG: fixed-size, semantic, recursive, and their impact on retrieval quality.
Imagine searching for a specific recipe in a massive cookbook library. If you search for "chocolate cake," you don't want the entire book, just the page with the recipe. But you also don't want just the line "2 cups of flour." You need a "chunk" of text that's complete enough to be useful but specific enough to be relevant. This is the core challenge of chunking: breaking down large documents into smaller pieces for an AI to search through.
In Retrieval-Augmented Generation (RAG) systems, this process is the most important design decision you'll make. [1] Get it wrong, and even the smartest Large Language Model (LLM) can't find the right information. This is partly due to the "lost in the middle" problem, where LLMs ignore information in the middle of long contexts. [2] An embedding model (which turns text into numerical representations) and the LLM both depend on well-formed chunks to recover the right context and answer questions accurately.
When configuring any chunking strategy, engineers face a fundamental tradeoff between precision and context. If chunks are too small, they become highly precise but lack the surrounding narrative. The retrieval system might find an exact match for a keyword, but the resulting chunk won't contain enough information for the LLM to understand the broader implications or formulate a helpful response.
On the other hand, if chunks are too large, they capture plenty of context but introduce irrelevant noise. An embedding model averages the semantic meaning of all tokens in a chunk into a single vector. If a chunk contains 2,000 tokens covering five different topics, its embedding becomes diluted and generic, making it difficult to retrieve when a user asks a specific question about just one of those topics.
The ideal "sweet spot" for chunk size typically falls between 256 and 1024 tokens, but this depends heavily on the content's nature and the specific embedding model. Dense, highly technical documents often require larger chunks to preserve complex relationships, while conversational logs can be effectively chunked in much smaller, turn-based segments.
There are several approaches to breaking down documents, ranging from naive character splitting to advanced semantic analysis. Selecting the right method depends on the document's structure, the embedding model's context window, and the acceptable processing latency.
This is the simplest and most common strategy. It ignores the document's structure and simply slices the text into windows of tokens. An overlap window is crucial to ensure that sentences or concepts cut off at a chunk boundary are preserved in the next. Without overlap, relevant context at chunk boundaries is completely lost, which can significantly impair downstream generation.
The following function demonstrates this approach. It takes a raw string, a tokenizer, and size parameters, and returns a list of string chunks. Notice how the loop steps forward by chunk_size - overlap to ensure continuity between adjacent chunks.
python1from typing import List, Any 2 3def fixed_size_chunk(text: str, tokenizer: Any, chunk_size: int = 512, overlap: int = 50) -> List[str]: 4 """ 5 Splits text into fixed-size chunks with overlap. 6 Assumes tokenizer has .encode() and .decode() methods. 7 """ 8 tokens = tokenizer.encode(text) 9 chunks = [] 10 11 # Iterate with a step size that accounts for overlap 12 step = chunk_size - overlap 13 14 for i in range(0, len(tokens), step): 15 chunk_tokens = tokens[i:i + chunk_size] 16 chunks.append(tokenizer.decode(chunk_tokens)) 17 18 return chunks
| Pros | Cons |
|---|---|
| Simple, predictable | Breaks mid-sentence/paragraph |
| Uniform embedding quality | No semantic awareness |
| Easy to index | May split related content |
Instead of slicing text at arbitrary token boundaries, recursive text splitting attempts to respect the natural linguistic structure of the document. This technique, widely adopted by libraries like LangChain[3], uses a hierarchy of separators to find the most logical place to break a chunk. By prioritizing larger semantic units like paragraphs over smaller ones like sentences, this method preserves the author's original flow of thought.
When a document is passed into a recursive splitter, the algorithm first tries to divide the text using the highest-level separator (typically double newlines, which indicate paragraph breaks). If the resulting pieces are smaller than the target chunk size, they are kept as-is. However, if a paragraph is still too large, the algorithm recursively applies the next separator in the hierarchy (such as single newlines or period-space combinations) to that specific piece.
This approach falls back gradually. It only splits a sentence in half if the sentence itself exceeds the maximum chunk size, which is rare. As a result, the chunks fed into the embedding model are much more likely to be complete, coherent thoughts. This directly improves the quality of the resulting vector and the accuracy of downstream retrieval.
Here's an example using LangChain's built-in splitter. You configure the target size, overlap, and the ordered list of separators. It returns an instantiated splitter object that you can run over your documents to produce well-formed text chunks.
python1from langchain.text_splitter import RecursiveCharacterTextSplitter 2 3# Splits first by double newlines (paragraphs), then single newlines, then sentences 4splitter = RecursiveCharacterTextSplitter( 5 chunk_size=512, 6 chunk_overlap=50, 7 separators=["\n\n", "\n", ". ", ", ", " ", ""], 8)
This method groups sentences by semantic similarity, a technique explored extensively by Kamradt[4]. Standard text splitters are blind to topic transitions. They might split a chunk exactly where the author transitions from describing a problem to proposing the solution, effectively separating the context from the answer. Semantic chunking solves this by treating the document as a continuous stream of semantic meaning.
The algorithm works by moving a sliding window over the document, embedding each sentence individually using a lightweight embedding model. It then calculates the cosine similarity between adjacent sentences. When the similarity drops below a predefined threshold, the algorithm detects a "semantic shift" (a change in topic) and places a chunk boundary there.
While this requires more upfront compute during ingestion, the resulting chunks are highly cohesive. This significantly improves the precision of similarity searches, as the embedding accurately reflects a single, unified concept.
This code illustrates a basic semantic chunking loop. It takes input text and an embedding model, generating embeddings for each sentence. By comparing adjacent sentence embeddings, it groups them into coherent chunks and returns the final list of semantic blocks.
python1from sentence_transformers import SentenceTransformer 2from sklearn.metrics.pairwise import cosine_similarity 3from typing import List 4 5def semantic_chunk(text: str, model: SentenceTransformer, threshold: float = 0.7) -> List[str]: 6 """Split text into semantic chunks based on sentence embeddings.""" 7 # Simple split by punctuation (simplified for example) 8 sentences = [s.strip() for s in text.split('. ') if s] 9 10 # 1. Embed all sentences 11 embeddings = model.encode(sentences) 12 13 chunks = [] 14 current_chunk = [sentences[0]] 15 16 # 2. Iterate through sentences and merge if similar 17 for i in range(1, len(sentences)): 18 # Calculate similarity between current sentence and previous sentence embedding 19 similarity = cosine_similarity( 20 [embeddings[i-1]], 21 [embeddings[i]] 22 )[0][0] 23 24 if similarity > threshold: 25 current_chunk.append(sentences[i]) 26 else: 27 # Semantic shift detected; start a new chunk 28 chunks.append(" ".join(current_chunk)) 29 current_chunk = [sentences[i]] 30 31 chunks.append(" ".join(current_chunk)) 32 return chunks
| Pros | Cons |
|---|---|
| Semantically coherent chunks | Slower (requires embedding each sentence) |
| Variable size matches content | Hard to control chunk sizes |
| Better retrieval quality | Threshold tuning required |
Using document structure (headings, sections, tables) creates meaningful chunks. This approach is a core component of frameworks like LlamaIndex[5] and preprocessing tools like Unstructured.io[6].
Enterprise documents like PDFs or Word files often contain rich structural metadata: H1 titles, H2 subtitles, bulleted lists, and tables. A naive text splitter destroys this structure, potentially mixing sections about 'API Authentication' with 'Billing Rates'. Document-aware chunking first parses the file into an Abstract Syntax Tree (AST), a tree representation of the document's structure. This allows it to identify the exact boundaries of each section before any text splitting occurs.
If a specific section (like an H3 subsection) is smaller than the target chunk size, it's embedded as a single cohesive unit. If the section is too large, it can be recursively split, but the chunks retain the structural context (the heading hierarchy) as metadata. This ensures that when the LLM receives the chunk, it understands exactly where the information lived within the broader document architecture.
The snippet below simulates a document-aware process. It takes a parsed markdown document and iterates through its sections. It returns a list of chunk dictionaries, where each chunk preserves its structural metadata alongside the text.
python1from dataclasses import dataclass 2from typing import List, Dict, Any 3 4@dataclass 5class Section: 6 content: str 7 heading: str 8 level: int 9 10# Mock helper functions for demonstration 11def split_by_headings(markdown: str) -> List[Section]: 12 """Parse markdown into sections.""" 13 return [] 14 15def count_tokens(text: str) -> int: 16 """Return an approximate token count.""" 17 return len(text.split()) 18 19def recursive_split(text: str, max_size: int) -> List[str]: 20 """Split text recursively up to max_size.""" 21 return [text[i:i+max_size] for i in range(0, len(text), max_size)] 22 23def document_aware_chunk(markdown: str) -> List[Dict[str, Any]]: 24 """Chunk a document while preserving section boundaries.""" 25 chunks = [] 26 sections = split_by_headings(markdown) 27 28 for section in sections: 29 if count_tokens(section.content) <= 1024: 30 # Small sections become single chunks 31 chunks.append({ 32 "text": section.content, 33 "heading": section.heading, 34 "level": section.level, 35 }) 36 else: 37 # Split large sections into smaller chunks, preserving heading context 38 sub_chunks = recursive_split(section.content, max_size=512) 39 for sub in sub_chunks: 40 chunks.append({ 41 "text": sub, 42 "heading": section.heading, 43 "level": section.level, 44 }) 45 return chunks
Instead of arbitrary text boundaries, Proposition Chunking extracts atomic, standalone statements (propositions) from text using an LLM. [7] When people write, they use pronouns, coreferences, and compound sentences to make the text flow. However, embedding models can struggle with this density. If a chunk says "It was deployed in 2023 and reduced latency by 40%", the vector may lack the crucial noun.
Proposition chunking uses an LLM to parse the raw text and generate a list of atomic facts where all pronouns are resolved to their proper entities. Because each proposition contains exactly one fact, the embedding vector is highly precise and focused. When a user asks a specific factual question, the cosine similarity between the query and the relevant proposition is much higher than it would be with a dense, multi-topic paragraph chunk. The main drawback is the significant cost and latency introduced by requiring an LLM inference step for every paragraph in the knowledge base during ingestion.
To implement this, you pass your raw text to an LLM with a system prompt that instructs the model to decompose the text and resolve pronouns. The prompt below demonstrates the expected format. The LLM takes the input text and outputs a clean list of atomic facts ready for embedding.
python1PROPOSITION_PROMPT = """ 2Decompose the following text into clear, simple propositions. 3Each proposition must be a self-contained sentence that makes sense 4without the surrounding context. Resolve pronouns (it, he, they) to specific nouns. 5 6Text: {text} 7 8Propositions: 9""" 10 11# Example: 12# Input: "Paris, the capital of France, has 2.1M people and is known for the Eiffel Tower." 13# Output: 14# 1. Paris is the capital of France. 15# 2. Paris has a population of 2.1 million people. 16# 3. Paris is known for the Eiffel Tower.
| Original Text | Extracted Propositions (Chunks) |
|---|---|
| "The 2023 roadmap focuses on Q3 deliverables. It also mentions a hiring freeze." | 1. The 2023 roadmap focuses on deliverables for the third quarter. 2. The 2023 roadmap mentions a hiring freeze. |
Choosing the right chunking strategy requires balancing implementation complexity with retrieval quality. In an enterprise setting, you'll often employ multiple chunking strategies simultaneously, routing different document types to specialized processing pipelines. For example, raw application logs might be perfectly suited for fast, fixed-size chunking, while high-value policy documents justify the computational cost of proposition chunking.
The diagram and table below summarize how these methods compare across different dimensions, helping engineers select the best approach for a specific workload. When making this decision, factor in the latency constraints of the ingestion pipeline and the computational budget available for embedding models, as more advanced techniques like semantic chunking require significantly more processing power.
| Strategy | Best For | Chunk Quality | Speed | Complexity |
|---|---|---|---|---|
| Fixed-size | Uniform content (logs, code) | Medium | Fast ✅ | Low ✅ |
| Recursive | General text | Good ✅ | Fast | Low |
| Semantic | Varied topics in long docs | Very Good | Slow | Medium |
| Document-aware | Structured docs (manuals, wikis) | Excellent ✅ | Medium | High |
| Proposition | Fact-heavy content | Excellent | Very Slow | Very High |
There's no universally perfect chunk size. The ideal configuration depends heavily on your document structure, content density, and the embedding model's optimal input length. Small chunks result in highly granular embeddings that precisely match specific queries, but they might lack the broader context necessary for the language model to synthesize a comprehensive answer. Large chunks capture extensive context but risk diluting the specific information a user is searching for, leading to lower retrieval scores for precise facts.
The guidelines below provide starting points based on common content types. For instance, technical documentation often contains code snippets, diagrams, and lengthy explanations that require larger chunks (512-1024 tokens) to preserve the complex relationships between concepts. Conversely, FAQ or knowledge base articles are typically self-contained and concise, meaning smaller chunks (128-256 tokens) are highly effective and reduce the noise passed to the LLM.
When configuring overlap, a good rule of thumb is to use 10-20% of your target chunk size. This ensures that concepts or sentences split at a chunk boundary remain connected. However, excessive overlap wastes storage space in the vector database and can lead to duplicated content dominating retrieval results, which may bias the LLM's final response.
| Content Type | Recommended Size | Overlap |
|---|---|---|
| Technical docs | 512-1024 tokens | 50-100 |
| Legal contracts | 256-512 tokens | 50 |
| FAQ/Knowledge base | 128-256 tokens | 0 |
| Code | Function/class level | 0 |
| Chat transcripts | Per-turn | 1-2 turns context |
A fundamental tension in RAG is that embedding models prefer small, highly specific chunks for high retrieval precision, while generative LLMs need broad, expansive context to synthesize high-quality answers. Parent-child retrieval (also known as auto-merging retrieval or small-to-big retrieval) resolves this tension by decoupling the chunk used for search from the chunk used for generation.
In this architecture, a large "parent" chunk (e.g., an entire section of 2000 tokens) is divided into multiple smaller "child" chunks (e.g., 200 tokens each). Only the child chunks are embedded and indexed in the vector database. Each child chunk maintains a pointer to its parent. [5]
When a user submits a query, vector search finds the most relevant child chunks with high precision. Before passing context to the LLM, the system follows the pointer back to the parent chunk and retrieves the surrounding context. If multiple child chunks from the same parent are retrieved, the system can deduplicate and pass the parent. This provides the LLM with the full narrative context it needs to generate a complete answer.
When a document is broken into many pieces, each individual chunk loses its connection to the broader work. A chunk containing the sentence "Restart the server to apply changes" is less useful if the system doesn't know which server or application the document refers to. Injecting metadata directly into the vector database payload ensures this context survives the chunking process.
Robust metadata schemas should capture source tracking (e.g., where did this come from?), structural context (e.g., where was this within the document?), and temporal data (e.g., when was this written?). This enables the retrieval system to perform hybrid search: filtering vectors by exact metadata matches before calculating semantic similarity. This significantly reduces the search space and helps prevent hallucinated answers from outdated documentation.
This function enriches a raw text chunk with contextual metadata. It takes the text, source document metadata, and index position, returning a comprehensive dictionary payload. This payload is what you'll ultimately store in your vector database.
python1import re 2from typing import Dict, Any 3 4class Document: 5 id: str 6 title: str 7 url: str 8 total_chunks: int 9 created_at: str 10 updated_at: str 11 def get_page(self, index: int) -> int: 12 """Return the page number for a given chunk index.""" 13 return 1 14 15def enrich_chunk(chunk_text: str, source_doc: Document, 16 chunk_index: int) -> Dict[str, Any]: 17 """Add contextual metadata to improve retrieval and attribution.""" 18 19 # Heuristic for token count (approx) 20 token_count = len(chunk_text.split()) * 1.3 21 22 return { 23 "text": chunk_text, 24 25 # Source tracking 26 "document_id": source_doc.id, 27 "document_title": source_doc.title, 28 "source_url": source_doc.url, 29 "page_number": source_doc.get_page(chunk_index), 30 31 # Structural context 32 "chunk_index": chunk_index, 33 "total_chunks": source_doc.total_chunks, 34 35 # Temporal metadata 36 "created_at": source_doc.created_at, 37 "updated_at": source_doc.updated_at, 38 39 # Content properties 40 "token_count": int(token_count), 41 "has_code": bool(re.search(r'```', chunk_text)), 42 "has_table": bool(re.search(r'\|.*\|', chunk_text)), 43 }
While storing metadata in the vector database is essential for filtering, it doesn't help the embedding model understand the chunk itself unless that metadata is injected directly into the text payload before vector computation. Contextual header prepending is a lightweight strategy to achieve this. By concatenating the document title and the hierarchical section path (e.g., Title > H1 > H2) to the top of every chunk, engineers can artificially ground the text.
This technique is very effective for short chunks that contain ambiguous terms. For example, a chunk that simply reads "The API rate limit is 50 requests per minute" might embed generically. But if it's prepended to read "Document: Stripe Billing API > Section: Throttling", the resulting vector will cluster tightly with queries specifically about Stripe's billing rate limits. This approach often yields significant improvements in retrieval recall with minimal implementation complexity.
Here's a straightforward implementation of contextual prepending. It takes a chunk dictionary containing text and metadata, and returns a new string where the hierarchical context is concatenated directly above the main content. This final string is what gets passed to the embedding model.
python1from typing import Dict, Any 2 3def contextualize_chunk(chunk: Dict[str, Any]) -> str: 4 """Prepend hierarchical context for better embedding quality.""" 5 context_parts = [] 6 7 if chunk.get("document_title"): 8 context_parts.append(f"Document: {chunk['document_title']}") 9 if chunk.get("section_heading"): 10 context_parts.append(f"Section: {chunk['section_heading']}") 11 12 context = " > ".join(context_parts) 13 return f"{context}\n\n{chunk['text']}" 14 15# Before: "The default port is 8080. Configure it in application.yaml." 16# After: "Document: Deployment Guide > Section: Configuration\n\nThe default port is 8080. Configure it in application.yaml."
In a standard RAG pipeline, text is split into chunks before being passed to the embedding model. This "early chunking" means the embedding model operates with a restricted context window. The attention mechanism within the Transformer (the foundational neural network architecture for most modern LLMs) can only see the tokens inside that specific chunk, making it blind to the broader narrative of the document.
Late chunking (introduced by Günther et al., 2024) reverses this order. It passes the entire document through the embedding model first, leveraging the model's full context window. [8] Because the Transformer applies bidirectional self-attention across the whole sequence, every token's contextualized representation is influenced by the entire document.
Only after this full-context forward pass does the system split the sequence into boundaries and apply mean pooling to generate the final chunk vectors. The resulting embeddings capture the specific details of the chunk while remaining deeply grounded in the document's overall semantic structure.
Traditional RAG pipelines are "early chunking" systems: you split first, then embed. This means the embedding model only sees the tokens inside each chunk. The attention mechanism is blind to the broader document narrative, so chunk boundaries can cut through related concepts and weaken the resulting vector.
Late chunking reverses this sequence. You embed the full document first, then split the resulting contextualized token embeddings into chunks. Because each token's representation was computed using the entire document's context (via bidirectional self-attention in the Transformer), the mean-pooled chunk embedding captures both the chunk's specific content and its place within the larger document structure.
The tradeoff is latency. Embedding a 10,000-token document costs significantly more than embedding five 512-token chunks, even though the total token count is identical. Late chunking only makes sense when your embedding model has a context window substantially larger than your target chunk size, and when document-level coherence matters more than raw ingestion speed.
| Method | Context Used | Retrieval Quality | Speed |
|---|---|---|---|
| Early chunking | Chunk only | Baseline | Fast |
| Late chunking | Full document | Improved recall | Slower (long-context inference) |
| Contextual prepending | Chunk + header | Improved recall | Same as baseline |
Without rigorous evaluation, tuning chunk sizes and overlap becomes guesswork. A robust evaluation framework requires a dataset of synthetic or historical user queries mapped to the exact document IDs containing the answers. By running these queries through the retrieval pipeline, engineers can measure the impact of different chunking strategies on metrics like Recall@K (the percentage of queries where the relevant document appears in the top K results).
When evaluating, you'll typically see a tradeoff curve. Smaller chunks might increase Recall@5 for highly specific fact-based queries, but they can cause the LLM generation step to fail because the surrounding context was lost. Conversely, massive chunks might reduce vector search recall but improve the LLM's synthesis quality when the correct chunk is found.
Production systems should use metrics that evaluate both the retrieval step and the final generation step (using frameworks like RAGAS, an automated evaluation suite for RAG pipelines[9]). By measuring end-to-end performance, you can empirically determine the chunk size and overlap that best serves a specific document corpus and query distribution.
The evaluation script below measures retrieval effectiveness. It takes generated chunks, a set of test queries, and known relevant document IDs, then computes the Recall@5 score by checking if the top retrieved chunks belong to the correct documents. This helps empirically determine the optimal chunk size for a given workload.
python1from sklearn.metrics.pairwise import cosine_similarity 2import numpy as np 3from typing import List, Dict, Any 4 5def evaluate_chunking( 6 chunks: List[Dict[str, Any]], 7 eval_queries: List[str], 8 eval_labels: List[List[str]], 9 embedding_model: Any 10) -> Dict[str, float]: 11 """Measure how well chunking supports retrieval.""" 12 13 # 1. Embed all chunks 14 chunk_texts = [c["text"] for c in chunks] 15 chunk_embeddings = embedding_model.encode(chunk_texts) 16 17 metrics: Dict[str, List[float]] = { 18 "recall@5": [], 19 } 20 21 for query, relevant_doc_ids in zip(eval_queries, eval_labels): 22 query_emb = embedding_model.encode([query]) # (1, dim) 23 24 # Calculate similarity (1, num_chunks) 25 similarities = cosine_similarity(query_emb, chunk_embeddings) 26 27 # Get top-5 indices 28 top_5_indices = similarities.argsort()[-5:][::-1] 29 30 # Check if any retrieved chunk comes from a relevant document 31 retrieved_docs = {chunks[i]["document_id"] for i in top_5_indices} 32 33 # Strict recall: is at least one relevant doc retrieved? 34 is_hit = len(retrieved_docs.intersection(set(relevant_doc_ids))) > 0 35 metrics["recall@5"].append(1.0 if is_hit else 0.0) 36 37 avg_tokens = np.mean([len(c["text"].split()) for c in chunks]) 38 39 return { 40 "recall@5": float(np.mean(metrics["recall@5"])), 41 "avg_chunk_tokens": float(avg_tokens) 42 }
🎯 Production tip: The most impactful chunking improvement isn't changing the algorithm. It's adding metadata. Teams that add document title, section heading, and contextual prepending to their chunks often see meaningful recall improvements with minimal latency cost. This should be done before investing in slower semantic chunking or proposition extraction.
Designing a robust chunking strategy is often an iterative process, and teams frequently encounter several systemic issues when deploying RAG applications.
This can split tables, code blocks, and lists mid-way. When enterprise data lakes contain a mix of raw text, structured tables, codebase files, and PDF manuals, a naive fixed-size text splitter often fails, destroying the semantic relationship between elements and rendering the resulting chunks less useful for the downstream LLM.
Without overlap, relevant context at chunk boundaries is lost, separating dependent clauses from their subjects. A sentence or concept that happens to cross a chunk boundary is fragmented, and the LLM may receive a chopped-off sentence, failing to construct a logical answer.
Chunks under 100 tokens often lack sufficient context for meaningful embedding, which can cause poor retrieval precision. They capture isolated facts but lack the necessary surrounding context to make those facts meaningful, leading to generic or ambiguous embeddings.
Chunks without source information (document title, section heading, page number) are difficult to trace and rank, and can't be used for exact-match filtering. When a chunk is retrieved, the system can't trace it back to its origin, making it impossible to perform hybrid filtering or provide proper citations to the user.
You'll often guess at chunk sizes instead of measuring retrieval recall. Many teams rely on intuition rather than empirical evaluation to determine chunk sizes, resulting in sub-optimal retrieval recall that could be easily fixed through systematic testing on a golden dataset.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Lewis, P., et al. · 2020 · NeurIPS 2020
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. · 2023 · TACL 2023
RecursiveCharacterTextSplitter
LangChain · 2023
5 Levels of Text Splitting.
Kamradt, G. · 2023
LlamaIndex: A Data Framework for LLM Applications
Liu, J. · 2024
Unstructured: The Ultimate ETL for LLMs
Unstructured.io · 2024
Dense X Retrieval: What Retrieval Granularity Should We Use?
Chen, T., et al. · 2023 · arXiv preprint
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models.
Günther, M., et al. · 2024 · arXiv preprint
RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Es, S., et al. · 2023 · arXiv preprint