LearnCore LLM FoundationsChunking Strategies

🔍MediumRAG & Retrieval

Chunking Strategies

Turn clean documents into retrieval units that preserve answers, citations, and measurable search quality.

13 min read

Learning path

Step 53 of 158 in the full curriculum

File Ingestion for AI LLM Benchmarks & Limitations

A clean incident-runbook record gives retrieval a concrete starting point:

incident-runbook-v3.pdf, page 7: Rollback failure: page on-call within 15 minutes with deploy ID.

That sentence is trustworthy evidence only while its pieces stay together. Split it into one chunk containing "Rollback failure" and another containing "15 minutes with deploy ID," and a search result can retrieve the condition without the deadline or the deadline without its required identifier.

Chunking turns normalized records into searchable evidence units. In a retrieval-augmented generation (RAG) system, a retriever selects passages from an external index and the generator answers from those passages.^{[1]Reference 1Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.https://arxiv.org/abs/2005.11401} Your chunk boundaries decide what a retrieved passage can prove.

Rollback-failure runbook compared across broken fixed cut and section boundary, showing why one retrieved fragment cannot prove paging window and required ID while section chunk can. — Fixed cut can rank fragment but can't prove answer. Section boundary preserves condition, deadline, proof, and locator, so same query retrieves defensible evidence.

What a useful chunk must preserve

Suppose an engineer asks, "A rollback failed after deploy pay-742. When do I page, and what ID do I include?"

Candidate retrieval unit	Searchable?	Answerable?	Problem
`Rollback failure: page on-call within`	yes	no	deadline and required ID are missing
`15 minutes with deploy ID. Routine deploy notes: archive within 14 days.`	yes	no	condition is missing and a competing rule is present
`Rollback failure: page on-call within 15 minutes with deploy ID.`	yes	yes	preserves condition, deadline, and required ID

The first engineering requirement isn't "make chunks small." It's "make each retrieved chunk a defensible piece of evidence."

check-answerable-chunks.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Chunk:
    chunk_id: str
    text: str
    source_id: str
    locator: str

def is_answerable(chunk: Chunk) -> bool:
    required = ["rollback failure", "15 minutes", "deploy id"]
    lowered = chunk.text.lower()
    return all(phrase in lowered for phrase in required)

chunks = [
    Chunk("broken-condition", "Rollback failure: page on-call within", "incident-runbook-v3.pdf", "page=7"),
    Chunk("broken-window", "15 minutes with deploy ID. Routine deploy notes archive within 14 days.", "incident-runbook-v3.pdf", "page=7"),
    Chunk("complete", "Rollback failure: page on-call within 15 minutes with deploy ID.", "incident-runbook-v3.pdf", "page=7"),
]

for chunk in chunks:
    print(f"{chunk.chunk_id}: answerable={is_answerable(chunk)}")

Output

broken-condition: answerable=False
broken-window: answerable=False
complete: answerable=True

Fixed windows show the boundary failure

The simplest splitter takes windows of tokens. If a window has size $N$ and overlap $O$ , the next window starts after $N - O$ tokens. Overlap repeats boundary text, which can help continuity, but it can't guarantee that a complete policy rule survives.

The lab uses whitespace-separated words as visible token stand-ins. A production pipeline should measure with the tokenizer used by its embedding model.

split-fixed-windows.py

def fixed_windows(text: str, size: int, overlap: int) -> list[str]:
    if size <= 0 or overlap < 0 or overlap >= size:
        raise ValueError("require size > overlap >= 0")
    words = text.split()
    step = size - overlap
    return [
        " ".join(words[start : start + size])
        for start in range(0, len(words), step)
        if words[start : start + size]
    ]

policy = (
    "Rollback failure page on-call within 15 minutes with deploy ID. "
    "Routine deploy notes archive within 14 days."
)

for index, chunk in enumerate(fixed_windows(policy, size=6, overlap=2)):
    print(f"{index}: {chunk}")

Output

Rollback failure page on-call within 15
within 15 minutes with deploy ID.
deploy ID. Routine deploy notes archive
notes archive within 14 days.
days.

The late notes archive within 14 days. window and final days. fragment expose a second failure: an orphan tail can lose the condition that gives a value meaning. A production fallback should merge an undersized tail into the previous window or send it for review instead of indexing it blindly.

Now measure what overlap bought you. More repeated words increase indexed text, but the rule becomes useful only when a window carries all three required pieces.

measure-overlap-cost-and-coverage.py

def fixed_windows(text: str, size: int, overlap: int) -> list[str]:
    words = text.split()
    step = size - overlap
    return [
        " ".join(words[start : start + size])
        for start in range(0, len(words), step)
        if words[start : start + size]
    ]

def carries_rollback_rule(text: str) -> bool:
    lowered = text.lower()
    return all(term in lowered for term in ["rollback failure", "15 minutes", "deploy id"])

policy = "Rollback failure page on-call within 15 minutes with deploy ID. Routine deploy notes archive within 14 days."
for overlap in [0, 2, 4]:
    chunks = fixed_windows(policy, size=7, overlap=overlap)
    complete = sum(carries_rollback_rule(chunk) for chunk in chunks)
    indexed_words = sum(len(chunk.split()) for chunk in chunks)
    print(f"overlap={overlap}: chunks={len(chunks)} indexed_words={indexed_words} complete={complete}")

Output

overlap=0: chunks=3 indexed_words=17 complete=0
overlap=2: chunks=4 indexed_words=23 complete=0
overlap=4: chunks=6 indexed_words=35 complete=0

This is the right posture for overlap: a configuration to evaluate, not a ritual to apply to every document.

Prefer policy structure when you have it

File ingestion preserved headings and source locations. Use them. A heading boundary is more meaningful than an arbitrary word count when the source already says which rule belongs together.

LangChain's RecursiveCharacterTextSplitter is a common generic-text baseline. Its documentation describes trying separators in order, with the default sequence ["\n\n", "\n", " ", ""], so paragraphs are preferred before smaller cuts.^{[2]Reference 2RecursiveCharacterTextSplitterhttps://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter} Start with explicit heading boundaries when the source provides them. Add the smaller-cut fallback only for sections that still exceed your size limit.

split-markdown-policy-sections.py

from dataclasses import dataclass

@dataclass(frozen=True)
class PolicyChunk:
    heading: str
    body: str
    indexed_text: str
    locator: str

def headed_chunks(markdown: str, source_locator: str) -> list[PolicyChunk]:
    chunks: list[PolicyChunk] = []
    heading = "Document"
    body: list[str] = []

    def flush() -> None:
        if body:
            body_text = " ".join(body)
            chunks.append(
                PolicyChunk(
                    heading,
                    body_text,
                    f"{heading}\n{body_text}",
                    f"{source_locator}#{heading.lower().replace(' ', '-')}",
                )
            )
            body.clear()

    for line in markdown.strip().splitlines():
        if line.startswith("## "):
            flush()
            heading = line[3:]
        elif line.strip():
            body.append(line.strip())
    flush()
    return chunks

policy = """## Rollback failure
Page on-call within 15 minutes with deploy ID.
## Routine deploy notes
Archive within 14 days after release."""

for chunk in headed_chunks(policy, "page=7"):
    print(chunk.indexed_text.replace("\n", " | "), "|", chunk.locator)

Output

Rollback failure | Page on-call within 15 minutes with deploy ID. | page=7#rollback-failure
Routine deploy notes | Archive within 14 days after release. | page=7#routine-deploy-notes

Each chunk's indexed_text includes its heading, so the searchable text keeps the condition attached to the deadline. The locator separately preserves the path back to original evidence.

A policy table needs an equally deliberate rule. Splitting a row away from its column names turns exact information into ambiguous fragments.

keep-policy-table-with-header.py

table = [
    "| Condition | Window | Required evidence |",
    "| --- | --- | --- |",
    "| Rollback failure | 15 minutes | Deploy ID |",
    "| Routine deploy notes | 14 days | Release summary |",
]

def table_chunk(lines: list[str], heading: str) -> dict[str, str]:
    return {
        "heading": heading,
        "text": "\n".join(lines),
        "quality_check": "header_present" if lines[0].startswith("| Condition |") else "review",
    }

chunk = table_chunk(table, "Runbook windows")
print(chunk["heading"], chunk["quality_check"], f"rows={len(table) - 2}")
print("15 minutes" in chunk["text"] and "Deploy ID" in chunk["text"])

Output

Runbook windows header_present rows=2
True

For a long section, split inside it while carrying its heading and original locator into every child chunk. Merge a tiny final window back into the preceding child so the fallback doesn't emit an orphan fragment.

split-long-section-with-provenance.py

def section_windows(
    text: str, heading: str, source: str, size: int, min_tail_words: int
) -> list[dict[str, str]]:
    words = text.split()
    windows = [words[start : start + size] for start in range(0, len(words), size)]
    if len(windows) > 1 and len(windows[-1]) < min_tail_words:
        windows[-2].extend(windows.pop())

    return [
        {
            "text": " ".join(window),
            "heading": heading,
            "source": source,
            "chunk_id": f"{source}#{heading.lower().replace(' ', '-')}-{index}",
        }
        for index, window in enumerate(windows)
    ]

children = section_windows(
    "Page on-call within 15 minutes with deploy ID. Include the service name and rollback attempt.",
    heading="Rollback failure",
    source="incident-runbook-v3.pdf:page=7",
    size=7,
    min_tail_words=4,
)

for child in children:
    print(child["chunk_id"], "|", child["heading"], "|", child["text"])

Output

incident-runbook-v3.pdf:page=7#rollback-failure-0 | Rollback failure | Page on-call within 15 minutes with deploy
incident-runbook-v3.pdf:page=7#rollback-failure-1 | Rollback failure | ID. Include the service name and rollback attempt.

Three chunking strategies compared: broken fixed window drops paging window and deploy ID, section-aware chunk preserves the full rule, and parent expansion returns wider context after child hit. — Use section boundary as answerable baseline. Parent expansion is escalation when precise child hit still needs wider source context.

Evaluate boundaries with labeled questions

A chunking strategy isn't good because it sounds fancy. It works when labeled questions retrieve complete supporting evidence.

Start with a tiny, transparent score before involving a vector database. For the query terms {rollback, failure, deploy, minutes}, the complete rollback-runbook chunk matches all four. A fragment that contains only {rollback, failure} may rank, but it can't answer the question.

Chunk	Matching query terms	Contains answer?
complete rollback rule	4 / 4	yes
rollback-condition fragment	2 / 4	no
routine archive rule	1 / 4	no

This lexical scorer is deliberately simple. It isolates the effect of boundaries; replace its score with real embeddings after the fixture and expected evidence are stable.

retrieve-complete-evidence.py

import re

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]:
    query_terms = terms(query)
    return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"])))

query = "rollback failure deploy minutes"
chunks = [
    {"id": "rollback-rule", "text": "Rollback failure: page on-call within 15 minutes with deploy ID."},
    {"id": "routine-rule", "text": "Routine deploy notes: archive within 14 days."},
]

hit = retrieve(query, chunks)
print(hit["id"], hit["text"])

Output

rollback-rule Rollback failure: page on-call within 15 minutes with deploy ID.

Now compare a broken fixed-window configuration against a section-aware configuration using the same question and the same expected evidence phrase.

compare-chunking-configurations.py

import re

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def top_chunk(query: str, chunks: list[str]) -> str:
    query_terms = terms(query)
    return max(chunks, key=lambda text: len(query_terms & terms(text)))

query = "rollback failure deploy id"
expected = "Rollback failure: page on-call within 15 minutes with deploy ID."
configs = {
    "broken-fixed": [
        "Rollback failure: page on-call within",
        "15 minutes with deploy ID. Routine deploy notes: archive within 14 days.",
    ],
    "section-aware": [
        "Rollback failure: page on-call within 15 minutes with deploy ID.",
        "Routine deploy notes: archive within 14 days.",
    ],
}

for name, chunks in configs.items():
    hit = top_chunk(query, chunks)
    print(f"{name}: complete={expected in hit}")

Output

broken-fixed: complete=False
section-aware: complete=True

One question proves little. Ship a small labeled set containing policy exceptions, tables, and boundary failures, then measure each candidate splitter on exactly that set.

run-chunk-regression-suite.py

import re

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]:
    query_terms = terms(query)
    return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"])))

chunks = [
    {"id": "rollback", "text": "Rollback failure: page on-call within 15 minutes with deploy ID."},
    {"id": "routine", "text": "Routine deploy notes: archive within 14 days."},
    {"id": "commander", "text": "Commander handoff requires the active incident channel."},
]
cases = [
    ("rollback failure deploy id", "rollback", "15 minutes"),
    ("routine deploy archive", "routine", "14 days"),
    ("commander incident channel", "commander", "active incident"),
]

passed = 0
for query, expected_id, evidence in cases:
    hit = retrieve(query, chunks)
    ok = hit["id"] == expected_id and evidence in hit["text"]
    passed += int(ok)
    print(query, "PASS" if ok else "FAIL")
print(f"summary={passed}/{len(cases)}")

Output

rollback failure deploy id PASS
routine deploy archive PASS
commander incident channel PASS
summary=3/3

When you replace this lexical baseline with embeddings, the assertions stay useful: retrieve the correct source and retain text sufficient to answer.

Search small, answer with enough context

Some questions match a narrow sentence, while a faithful answer needs its surrounding section. A parent-child design indexes small children for search and stores a pointer to the larger source section returned for generation.^{[3]Reference 3ParentDocumentRetrieverhttps://reference.langchain.com/python/langchain-classic/retrievers/parent_document_retriever/ParentDocumentRetriever}

Diagram showing Clean page record, Rollback runbook section, split for index, and Small indexed child. — Clean page record, Rollback runbook section, split for index, and Small indexed child.

expand-child-match-to-parent-context.py

parents = {
    "rollback": "Rollback failure: page on-call within 15 minutes with deploy ID. Include the service name.",
    "routine": "Routine deploy notes: archive within 14 days after release.",
}
children = [
    {"text": "15 minutes with deploy ID", "parent_id": "rollback"},
    {"text": "14 days after release", "parent_id": "routine"},
]

match = next(child for child in children if "deploy ID" in child["text"])
print(match["text"])
print(parents[match["parent_id"]])

Output

15 minutes with deploy ID
Rollback failure: page on-call within 15 minutes with deploy ID. Include the service name.

Child windows inside a longer section also benefit from their section label. Embedding a child with a contextual header is a cheap hypothesis to test against your labeled set, not a promise of improvement.

prepend-section-context.py

def indexed_text(heading: str, text: str, source: str) -> str:
    return f"Source: {source}\nSection: {heading}\n{text}"

child = indexed_text(
    heading="Rollback failure",
    text="Page on-call within 15 minutes with deploy ID.",
    source="Incident Runbook",
)
print(child)

Output

Source: Incident Runbook
Section: Rollback failure
Page on-call within 15 minutes with deploy ID.

Escalate only when the baseline exposes a gap

Structure-aware chunks handle many handbooks and policy pages. Some corpora force different choices:

Failure after measuring baseline	Candidate experiment	What must still be checked
one paragraph shifts between multiple topics	semantic boundary detection	hard size cap and labeled-query score
tiny match lacks surrounding explanation	parent-child expansion	deduplication and generation budget
short child is ambiguous without its section	contextual header	retrieval comparison against no-header baseline
meaning depends on far-away document context	late chunking	model support, latency, and measured retrieval gain

Semantic chunking places candidate boundaries where neighboring sentence representations change sharply. The next small lab uses transparent topic vectors, so you can see the boundary without trusting an external embedding service.

find-semantic-boundary-candidates.py

import math

def vector(sentence: str) -> list[float]:
    lowered = sentence.lower()
    return [
        float(sum(word in lowered for word in ["rollback", "deploy", "minutes", "id"])),
        float(sum(word in lowered for word in ["commander", "handoff", "channel"])),
    ]

def cosine(left: list[float], right: list[float]) -> float:
    dot = sum(a * b for a, b in zip(left, right))
    left_norm = math.sqrt(sum(a * a for a in left))
    right_norm = math.sqrt(sum(b * b for b in right))
    return dot / (left_norm * right_norm) if left_norm and right_norm else 0.0

sentences = [
    "Rollback failures require deploy IDs.",
    "Page on-call within 15 minutes.",
    "Commander handoff uses the active incident channel.",
]

for left, right in zip(sentences, sentences[1:]):
    similarity = cosine(vector(left), vector(right))
    print(f"{similarity:.2f}", "boundary" if similarity < 0.50 else "keep together")

Output

1.00 keep together
0.32 boundary

Late chunking is a different escalation. Günther et al. encode the longer text first and apply chunk pooling after the transformer's contextual token representations have been computed.^{[4]Reference 4Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models.https://arxiv.org/abs/2409.04701} That means a child representation can carry information from surrounding document text, but it requires a compatible long-context embedding stack and must earn its additional cost in evaluation.

Early and late chunking compare same token ranges: early keeps isolated block context, while late pools chunk vectors after full-document contextualization. — Pooled ranges stay same. Late chunking changes representation by allowing document-wide context before pooling, and it should stay only when that extra context improves measured retrieval.

Ship a chunking decision, not a guess

For the incident-runbook corpus, a credible first release is straightforward:

Design choice	Initial decision	Evidence to collect
Default boundary	heading-aware sections, recursive fallback	answerable-chunk rate and retrieval regression suite
Tables	keep header plus rows together	exact-value questions preserve correct row meaning
Overlap	off for complete policy sections; test for fallback text	index size, duplicate hits, and labeled-query results
Metadata	source, locator, heading, checksum	cited answer can return to original record
Escalation	parent expansion before semantic or late chunking	failure examples that justify extra complexity

Your release gate can be encoded as an executable manifest check.

gate-indexable-policy-chunks.py

from hashlib import sha256

chunks = [
    {
        "id": "rollback",
        "text": "Rollback failure: page on-call within 15 minutes with deploy ID.",
        "source": "incident-runbook-v3.pdf",
        "locator": "page=7#rollback-failure",
        "heading": "Rollback failure",
        "quality": "ready",
    },
    {
        "id": "broken",
        "text": "15 minutes with deploy ID.",
        "source": "incident-runbook-v3.pdf",
        "locator": "page=7#fragment",
        "heading": "Fragment",
        "quality": "review",
    },
]

for chunk in chunks:
    chunk["checksum"] = sha256(chunk["text"].encode()).hexdigest()

required_metadata = ("source", "locator", "heading", "checksum")
indexable = [
    chunk for chunk in chunks
    if chunk["quality"] == "ready"
    and all(chunk[field] for field in required_metadata)
]
print(f"indexable={[chunk['id'] for chunk in indexable]}")
print(f"blocked={len(chunks) - len(indexable)}")

Output

indexable=['rollback']
blocked=1

Mastery check

Clean evidence records from ingestion can now become retrieval units that can be evaluated.

Key concepts

A chunk is a searchable evidence unit, not an arbitrary slice of text.
Fixed windows expose boundary and overlap costs directly.
Structure-aware splitting is a strong baseline when headings or tables carry meaning.
Parent-child expansion separates precise search from sufficient answer context.
Semantic or late chunking should be escalations justified by measured failures.
A labeled retrieval set must check both correct selection and answerable evidence.

Evaluation rubric

Foundational: Identifies why a broken chunk can't answer the rollback-failure question
Foundational: Implements fixed windows and explains what overlap repeats
Intermediate: Preserves headings, tables, source IDs, and locators in chunk records
Intermediate: Compares candidate boundaries on labeled retrieval cases
Intermediate: Uses parent-child expansion when narrow retrieval lacks sufficient context
Advanced: Chooses semantic or late chunking only after measuring a baseline failure

Follow-up questions

Common pitfalls

A deadline loses its condition: A window retrieves 15 minutes without Rollback failure. Fix: split at section boundaries and assert answerability.
A fallback emits an orphan tail: A tiny final window retrieves 14 days. without Routine deploy notes. Fix: merge undersized tails into the preceding child or block them for review.
A table row loses its header: A paging or archive window becomes ambiguous. Fix: keep column names with rows and test exact-value questions.
Overlap creates duplicates without fixing answers: Repeated chunks dominate top results. Fix: measure overlap against both complete evidence and index cost.
Chunks can't be cited: Search looks plausible but support can't defend the answer. Fix: retain source, locator, heading, and checksum metadata.
An advanced splitter is adopted by reputation: Complexity rises without better results. Fix: preserve a simple baseline and evaluate every escalation on the same labeled set.

Next Step

Continue to LLM Benchmarks & Limitations

You can now build evidence-level retrieval tests for a <span data-glossary="rag">RAG</span> system. Next comes the same discipline for evaluation sets and scores before they can support a model-quality claim.

PreviousFile Ingestion for AI

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

RecursiveCharacterTextSplitter

LangChain · 2023

ParentDocumentRetriever

LangChain · 2024

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models.

Günther, M., et al. · 2024 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnCore LLM FoundationsChunking Strategies

🔍MediumRAG & Retrieval

Chunking Strategies

Turn clean documents into retrieval units that preserve answers, citations, and measurable search quality.

13 min read

Learning path

Step 53 of 158 in the full curriculum

File Ingestion for AI LLM Benchmarks & Limitations

A clean incident-runbook record gives retrieval a concrete starting point:

incident-runbook-v3.pdf, page 7: Rollback failure: page on-call within 15 minutes with deploy ID.

What a useful chunk must preserve

Suppose an engineer asks, "A rollback failed after deploy pay-742. When do I page, and what ID do I include?"

Candidate retrieval unit	Searchable?	Answerable?	Problem
`Rollback failure: page on-call within`	yes	no	deadline and required ID are missing
`15 minutes with deploy ID. Routine deploy notes: archive within 14 days.`	yes	no	condition is missing and a competing rule is present
`Rollback failure: page on-call within 15 minutes with deploy ID.`	yes	yes	preserves condition, deadline, and required ID

The first engineering requirement isn't "make chunks small." It's "make each retrieved chunk a defensible piece of evidence."

check-answerable-chunks.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Chunk:
    chunk_id: str
    text: str
    source_id: str
    locator: str

def is_answerable(chunk: Chunk) -> bool:
    required = ["rollback failure", "15 minutes", "deploy id"]
    lowered = chunk.text.lower()
    return all(phrase in lowered for phrase in required)

chunks = [
    Chunk("broken-condition", "Rollback failure: page on-call within", "incident-runbook-v3.pdf", "page=7"),
    Chunk("broken-window", "15 minutes with deploy ID. Routine deploy notes archive within 14 days.", "incident-runbook-v3.pdf", "page=7"),
    Chunk("complete", "Rollback failure: page on-call within 15 minutes with deploy ID.", "incident-runbook-v3.pdf", "page=7"),
]

for chunk in chunks:
    print(f"{chunk.chunk_id}: answerable={is_answerable(chunk)}")

Output

broken-condition: answerable=False
broken-window: answerable=False
complete: answerable=True

Fixed windows show the boundary failure

The lab uses whitespace-separated words as visible token stand-ins. A production pipeline should measure with the tokenizer used by its embedding model.

split-fixed-windows.py

def fixed_windows(text: str, size: int, overlap: int) -> list[str]:
    if size <= 0 or overlap < 0 or overlap >= size:
        raise ValueError("require size > overlap >= 0")
    words = text.split()
    step = size - overlap
    return [
        " ".join(words[start : start + size])
        for start in range(0, len(words), step)
        if words[start : start + size]
    ]

policy = (
    "Rollback failure page on-call within 15 minutes with deploy ID. "
    "Routine deploy notes archive within 14 days."
)

for index, chunk in enumerate(fixed_windows(policy, size=6, overlap=2)):
    print(f"{index}: {chunk}")

Output

Rollback failure page on-call within 15
within 15 minutes with deploy ID.
deploy ID. Routine deploy notes archive
notes archive within 14 days.
days.

Now measure what overlap bought you. More repeated words increase indexed text, but the rule becomes useful only when a window carries all three required pieces.

measure-overlap-cost-and-coverage.py

def fixed_windows(text: str, size: int, overlap: int) -> list[str]:
    words = text.split()
    step = size - overlap
    return [
        " ".join(words[start : start + size])
        for start in range(0, len(words), step)
        if words[start : start + size]
    ]

def carries_rollback_rule(text: str) -> bool:
    lowered = text.lower()
    return all(term in lowered for term in ["rollback failure", "15 minutes", "deploy id"])

policy = "Rollback failure page on-call within 15 minutes with deploy ID. Routine deploy notes archive within 14 days."
for overlap in [0, 2, 4]:
    chunks = fixed_windows(policy, size=7, overlap=overlap)
    complete = sum(carries_rollback_rule(chunk) for chunk in chunks)
    indexed_words = sum(len(chunk.split()) for chunk in chunks)
    print(f"overlap={overlap}: chunks={len(chunks)} indexed_words={indexed_words} complete={complete}")

Output

overlap=0: chunks=3 indexed_words=17 complete=0
overlap=2: chunks=4 indexed_words=23 complete=0
overlap=4: chunks=6 indexed_words=35 complete=0

This is the right posture for overlap: a configuration to evaluate, not a ritual to apply to every document.

Prefer policy structure when you have it

File ingestion preserved headings and source locations. Use them. A heading boundary is more meaningful than an arbitrary word count when the source already says which rule belongs together.

split-markdown-policy-sections.py

from dataclasses import dataclass

@dataclass(frozen=True)
class PolicyChunk:
    heading: str
    body: str
    indexed_text: str
    locator: str

def headed_chunks(markdown: str, source_locator: str) -> list[PolicyChunk]:
    chunks: list[PolicyChunk] = []
    heading = "Document"
    body: list[str] = []

    def flush() -> None:
        if body:
            body_text = " ".join(body)
            chunks.append(
                PolicyChunk(
                    heading,
                    body_text,
                    f"{heading}\n{body_text}",
                    f"{source_locator}#{heading.lower().replace(' ', '-')}",
                )
            )
            body.clear()

    for line in markdown.strip().splitlines():
        if line.startswith("## "):
            flush()
            heading = line[3:]
        elif line.strip():
            body.append(line.strip())
    flush()
    return chunks

policy = """## Rollback failure
Page on-call within 15 minutes with deploy ID.
## Routine deploy notes
Archive within 14 days after release."""

for chunk in headed_chunks(policy, "page=7"):
    print(chunk.indexed_text.replace("\n", " | "), "|", chunk.locator)

Output

Rollback failure | Page on-call within 15 minutes with deploy ID. | page=7#rollback-failure
Routine deploy notes | Archive within 14 days after release. | page=7#routine-deploy-notes

Each chunk's indexed_text includes its heading, so the searchable text keeps the condition attached to the deadline. The locator separately preserves the path back to original evidence.

A policy table needs an equally deliberate rule. Splitting a row away from its column names turns exact information into ambiguous fragments.

keep-policy-table-with-header.py

table = [
    "| Condition | Window | Required evidence |",
    "| --- | --- | --- |",
    "| Rollback failure | 15 minutes | Deploy ID |",
    "| Routine deploy notes | 14 days | Release summary |",
]

def table_chunk(lines: list[str], heading: str) -> dict[str, str]:
    return {
        "heading": heading,
        "text": "\n".join(lines),
        "quality_check": "header_present" if lines[0].startswith("| Condition |") else "review",
    }

chunk = table_chunk(table, "Runbook windows")
print(chunk["heading"], chunk["quality_check"], f"rows={len(table) - 2}")
print("15 minutes" in chunk["text"] and "Deploy ID" in chunk["text"])

Output

Runbook windows header_present rows=2
True

split-long-section-with-provenance.py

def section_windows(
    text: str, heading: str, source: str, size: int, min_tail_words: int
) -> list[dict[str, str]]:
    words = text.split()
    windows = [words[start : start + size] for start in range(0, len(words), size)]
    if len(windows) > 1 and len(windows[-1]) < min_tail_words:
        windows[-2].extend(windows.pop())

    return [
        {
            "text": " ".join(window),
            "heading": heading,
            "source": source,
            "chunk_id": f"{source}#{heading.lower().replace(' ', '-')}-{index}",
        }
        for index, window in enumerate(windows)
    ]

children = section_windows(
    "Page on-call within 15 minutes with deploy ID. Include the service name and rollback attempt.",
    heading="Rollback failure",
    source="incident-runbook-v3.pdf:page=7",
    size=7,
    min_tail_words=4,
)

for child in children:
    print(child["chunk_id"], "|", child["heading"], "|", child["text"])

Output

incident-runbook-v3.pdf:page=7#rollback-failure-0 | Rollback failure | Page on-call within 15 minutes with deploy
incident-runbook-v3.pdf:page=7#rollback-failure-1 | Rollback failure | ID. Include the service name and rollback attempt.

Evaluate boundaries with labeled questions

A chunking strategy isn't good because it sounds fancy. It works when labeled questions retrieve complete supporting evidence.

Chunk	Matching query terms	Contains answer?
complete rollback rule	4 / 4	yes
rollback-condition fragment	2 / 4	no
routine archive rule	1 / 4	no

This lexical scorer is deliberately simple. It isolates the effect of boundaries; replace its score with real embeddings after the fixture and expected evidence are stable.

retrieve-complete-evidence.py

import re

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]:
    query_terms = terms(query)
    return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"])))

query = "rollback failure deploy minutes"
chunks = [
    {"id": "rollback-rule", "text": "Rollback failure: page on-call within 15 minutes with deploy ID."},
    {"id": "routine-rule", "text": "Routine deploy notes: archive within 14 days."},
]

hit = retrieve(query, chunks)
print(hit["id"], hit["text"])

Output

rollback-rule Rollback failure: page on-call within 15 minutes with deploy ID.

Now compare a broken fixed-window configuration against a section-aware configuration using the same question and the same expected evidence phrase.

compare-chunking-configurations.py

import re

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def top_chunk(query: str, chunks: list[str]) -> str:
    query_terms = terms(query)
    return max(chunks, key=lambda text: len(query_terms & terms(text)))

query = "rollback failure deploy id"
expected = "Rollback failure: page on-call within 15 minutes with deploy ID."
configs = {
    "broken-fixed": [
        "Rollback failure: page on-call within",
        "15 minutes with deploy ID. Routine deploy notes: archive within 14 days.",
    ],
    "section-aware": [
        "Rollback failure: page on-call within 15 minutes with deploy ID.",
        "Routine deploy notes: archive within 14 days.",
    ],
}

for name, chunks in configs.items():
    hit = top_chunk(query, chunks)
    print(f"{name}: complete={expected in hit}")

Output

broken-fixed: complete=False
section-aware: complete=True

One question proves little. Ship a small labeled set containing policy exceptions, tables, and boundary failures, then measure each candidate splitter on exactly that set.

run-chunk-regression-suite.py

import re

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]:
    query_terms = terms(query)
    return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"])))

chunks = [
    {"id": "rollback", "text": "Rollback failure: page on-call within 15 minutes with deploy ID."},
    {"id": "routine", "text": "Routine deploy notes: archive within 14 days."},
    {"id": "commander", "text": "Commander handoff requires the active incident channel."},
]
cases = [
    ("rollback failure deploy id", "rollback", "15 minutes"),
    ("routine deploy archive", "routine", "14 days"),
    ("commander incident channel", "commander", "active incident"),
]

passed = 0
for query, expected_id, evidence in cases:
    hit = retrieve(query, chunks)
    ok = hit["id"] == expected_id and evidence in hit["text"]
    passed += int(ok)
    print(query, "PASS" if ok else "FAIL")
print(f"summary={passed}/{len(cases)}")

Output

rollback failure deploy id PASS
routine deploy archive PASS
commander incident channel PASS
summary=3/3

When you replace this lexical baseline with embeddings, the assertions stay useful: retrieve the correct source and retain text sufficient to answer.

Search small, answer with enough context

expand-child-match-to-parent-context.py

parents = {
    "rollback": "Rollback failure: page on-call within 15 minutes with deploy ID. Include the service name.",
    "routine": "Routine deploy notes: archive within 14 days after release.",
}
children = [
    {"text": "15 minutes with deploy ID", "parent_id": "rollback"},
    {"text": "14 days after release", "parent_id": "routine"},
]

match = next(child for child in children if "deploy ID" in child["text"])
print(match["text"])
print(parents[match["parent_id"]])

Output

15 minutes with deploy ID
Rollback failure: page on-call within 15 minutes with deploy ID. Include the service name.

prepend-section-context.py

def indexed_text(heading: str, text: str, source: str) -> str:
    return f"Source: {source}\nSection: {heading}\n{text}"

child = indexed_text(
    heading="Rollback failure",
    text="Page on-call within 15 minutes with deploy ID.",
    source="Incident Runbook",
)
print(child)

Output

Source: Incident Runbook
Section: Rollback failure
Page on-call within 15 minutes with deploy ID.

Escalate only when the baseline exposes a gap

Structure-aware chunks handle many handbooks and policy pages. Some corpora force different choices:

Failure after measuring baseline	Candidate experiment	What must still be checked
one paragraph shifts between multiple topics	semantic boundary detection	hard size cap and labeled-query score
tiny match lacks surrounding explanation	parent-child expansion	deduplication and generation budget
short child is ambiguous without its section	contextual header	retrieval comparison against no-header baseline
meaning depends on far-away document context	late chunking	model support, latency, and measured retrieval gain

find-semantic-boundary-candidates.py

import math

def vector(sentence: str) -> list[float]:
    lowered = sentence.lower()
    return [
        float(sum(word in lowered for word in ["rollback", "deploy", "minutes", "id"])),
        float(sum(word in lowered for word in ["commander", "handoff", "channel"])),
    ]

def cosine(left: list[float], right: list[float]) -> float:
    dot = sum(a * b for a, b in zip(left, right))
    left_norm = math.sqrt(sum(a * a for a in left))
    right_norm = math.sqrt(sum(b * b for b in right))
    return dot / (left_norm * right_norm) if left_norm and right_norm else 0.0

sentences = [
    "Rollback failures require deploy IDs.",
    "Page on-call within 15 minutes.",
    "Commander handoff uses the active incident channel.",
]

for left, right in zip(sentences, sentences[1:]):
    similarity = cosine(vector(left), vector(right))
    print(f"{similarity:.2f}", "boundary" if similarity < 0.50 else "keep together")

Output

1.00 keep together
0.32 boundary

Ship a chunking decision, not a guess

For the incident-runbook corpus, a credible first release is straightforward:

Design choice	Initial decision	Evidence to collect
Default boundary	heading-aware sections, recursive fallback	answerable-chunk rate and retrieval regression suite
Tables	keep header plus rows together	exact-value questions preserve correct row meaning
Overlap	off for complete policy sections; test for fallback text	index size, duplicate hits, and labeled-query results
Metadata	source, locator, heading, checksum	cited answer can return to original record
Escalation	parent expansion before semantic or late chunking	failure examples that justify extra complexity

Your release gate can be encoded as an executable manifest check.

gate-indexable-policy-chunks.py

from hashlib import sha256

chunks = [
    {
        "id": "rollback",
        "text": "Rollback failure: page on-call within 15 minutes with deploy ID.",
        "source": "incident-runbook-v3.pdf",
        "locator": "page=7#rollback-failure",
        "heading": "Rollback failure",
        "quality": "ready",
    },
    {
        "id": "broken",
        "text": "15 minutes with deploy ID.",
        "source": "incident-runbook-v3.pdf",
        "locator": "page=7#fragment",
        "heading": "Fragment",
        "quality": "review",
    },
]

for chunk in chunks:
    chunk["checksum"] = sha256(chunk["text"].encode()).hexdigest()

required_metadata = ("source", "locator", "heading", "checksum")
indexable = [
    chunk for chunk in chunks
    if chunk["quality"] == "ready"
    and all(chunk[field] for field in required_metadata)
]
print(f"indexable={[chunk['id'] for chunk in indexable]}")
print(f"blocked={len(chunks) - len(indexable)}")

Output

indexable=['rollback']
blocked=1

Mastery check

Clean evidence records from ingestion can now become retrieval units that can be evaluated.

Key concepts

A chunk is a searchable evidence unit, not an arbitrary slice of text.
Fixed windows expose boundary and overlap costs directly.
Structure-aware splitting is a strong baseline when headings or tables carry meaning.
Parent-child expansion separates precise search from sufficient answer context.
Semantic or late chunking should be escalations justified by measured failures.
A labeled retrieval set must check both correct selection and answerable evidence.

Evaluation rubric

Foundational: Identifies why a broken chunk can't answer the rollback-failure question
Foundational: Implements fixed windows and explains what overlap repeats
Intermediate: Preserves headings, tables, source IDs, and locators in chunk records
Intermediate: Compares candidate boundaries on labeled retrieval cases
Intermediate: Uses parent-child expansion when narrow retrieval lacks sufficient context
Advanced: Chooses semantic or late chunking only after measuring a baseline failure

Follow-up questions

Common pitfalls

A deadline loses its condition: A window retrieves 15 minutes without Rollback failure. Fix: split at section boundaries and assert answerability.
A fallback emits an orphan tail: A tiny final window retrieves 14 days. without Routine deploy notes. Fix: merge undersized tails into the preceding child or block them for review.
A table row loses its header: A paging or archive window becomes ambiguous. Fix: keep column names with rows and test exact-value questions.
Overlap creates duplicates without fixing answers: Repeated chunks dominate top results. Fix: measure overlap against both complete evidence and index cost.
Chunks can't be cited: Search looks plausible but support can't defend the answer. Fix: retain source, locator, heading, and checksum metadata.
An advanced splitter is adopted by reputation: Complexity rises without better results. Fix: preserve a simple baseline and evaluate every escalation on the same labeled set.

Next Step

Continue to LLM Benchmarks & Limitations

PreviousFile Ingestion for AI

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

RecursiveCharacterTextSplitter

LangChain · 2023

ParentDocumentRetriever

LangChain · 2024

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models.

Günther, M., et al. · 2024 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Chunking Strategies

What a useful chunk must preserve

Fixed windows show the boundary failure

Prefer policy structure when you have it

Evaluate boundaries with labeled questions

Search small, answer with enough context

Escalate only when the baseline exposes a gap

Ship a chunking decision, not a guess

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check

Discussion

Chunking Strategies

What a useful chunk must preserve

Fixed windows show the boundary failure

Prefer policy structure when you have it

Evaluate boundaries with labeled questions

Search small, answer with enough context

Escalate only when the baseline exposes a gap

Ship a chunking decision, not a guess

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check

Discussion

Chunking Strategies

What a useful chunk must preserve

Fixed windows show the boundary failure

Why shouldn't the trailing notes archive within 14 days. window enter the index by itself?

Prefer policy structure when you have it

Evaluate boundaries with labeled questions

Search small, answer with enough context

Escalate only when the baseline exposes a gap

Ship a chunking decision, not a guess

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why is a chunk containing 15 minutes with deploy ID insufficient by itself?

When should overlap be set to zero?

Why retain the locator from the ingestion record in every chunk?

What problem does parent-child retrieval solve?

What evidence justifies switching from structure-aware splitting to late chunking?

Common pitfalls

Mastery Check

Discussion

Chunking Strategies

What a useful chunk must preserve

Fixed windows show the boundary failure

Why shouldn't the trailing notes archive within 14 days. window enter the index by itself?

Prefer policy structure when you have it

Evaluate boundaries with labeled questions

Search small, answer with enough context

Escalate only when the baseline exposes a gap

Ship a chunking decision, not a guess

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why is a chunk containing 15 minutes with deploy ID insufficient by itself?

When should overlap be set to zero?

Why retain the locator from the ingestion record in every chunk?

What problem does parent-child retrieval solve?

What evidence justifies switching from structure-aware splitting to late chunking?

Common pitfalls

Mastery Check

Discussion