Turn clean documents into retrieval units that preserve answers, citations, and measurable search quality.
The previous lesson produced a clean record from the ShopFlow returns policy:
returns-policy-v3.pdf, page 7: Damaged electronics: report within 48 hours with photos.
That sentence is trustworthy evidence only while its pieces stay together. Split it into one chunk containing "Damaged electronics" and another containing "48 hours with photos," and a search result can retrieve the condition without the deadline or the deadline without its condition.
Chunking turns normalized records into searchable evidence units. In a retrieval-augmented generation (RAG) system, a retriever selects passages from an external index and the generator answers from those passages.[1] Your chunk boundaries decide what a retrieved passage can prove.
Suppose a customer asks, "My headphones arrived damaged. When must I report it, and what do I send?"
| Candidate retrieval unit | Searchable? | Answerable? | Problem |
|---|---|---|---|
Damaged electronics: report within | yes | no | deadline and proof are missing |
48 hours with photos. Unopened electronics: return within 14 days. | yes | no | condition is missing and a competing rule is present |
Damaged electronics: report within 48 hours with photos. | yes | yes | preserves condition, deadline, and proof |
The first engineering requirement isn't "make chunks small." It's "make each retrieved chunk a defensible piece of evidence."
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Chunk:
5 chunk_id: str
6 text: str
7 source_id: str
8 locator: str
9
10def is_answerable(chunk: Chunk) -> bool:
11 required = ["damaged electronics", "48 hours", "photos"]
12 lowered = chunk.text.lower()
13 return all(phrase in lowered for phrase in required)
14
15chunks = [
16 Chunk("broken-condition", "Damaged electronics: report within", "returns-policy-v3.pdf", "page=7"),
17 Chunk("broken-window", "48 hours with photos. Unopened electronics return within 14 days.", "returns-policy-v3.pdf", "page=7"),
18 Chunk("complete", "Damaged electronics: report within 48 hours with photos.", "returns-policy-v3.pdf", "page=7"),
19]
20
21for chunk in chunks:
22 print(f"{chunk.chunk_id}: answerable={is_answerable(chunk)}")1broken-condition: answerable=False
2broken-window: answerable=False
3complete: answerable=TrueThe simplest splitter takes windows of tokens. If a window has size and overlap , the next window starts after tokens. Overlap repeats boundary text, which can help continuity, but it can't guarantee that a complete policy rule survives.
The lab uses whitespace-separated words as visible token stand-ins. A production pipeline should measure with the tokenizer used by its embedding model.
1def fixed_windows(text: str, size: int, overlap: int) -> list[str]:
2 if size <= 0 or overlap < 0 or overlap >= size:
3 raise ValueError("require size > overlap >= 0")
4 words = text.split()
5 step = size - overlap
6 return [
7 " ".join(words[start : start + size])
8 for start in range(0, len(words), step)
9 if words[start : start + size]
10 ]
11
12policy = (
13 "Damaged electronics report within 48 hours with photos. "
14 "Unopened electronics return within 14 days."
15)
16
17for index, chunk in enumerate(fixed_windows(policy, size=6, overlap=2)):
18 print(f"{index}: {chunk}")10: Damaged electronics report within 48 hours
21: 48 hours with photos. Unopened electronics
32: Unopened electronics return within 14 days.
43: 14 days.Now measure what overlap bought you. More repeated words increase indexed text, but the rule becomes useful only when a window carries all three required pieces.
1def fixed_windows(text: str, size: int, overlap: int) -> list[str]:
2 words = text.split()
3 step = size - overlap
4 return [
5 " ".join(words[start : start + size])
6 for start in range(0, len(words), step)
7 if words[start : start + size]
8 ]
9
10def carries_damage_rule(text: str) -> bool:
11 lowered = text.lower()
12 return all(term in lowered for term in ["damaged electronics", "48 hours", "photos"])
13
14policy = "Damaged electronics report within 48 hours with photos. Unopened electronics return within 14 days."
15for overlap in [0, 2, 4]:
16 chunks = fixed_windows(policy, size=7, overlap=overlap)
17 complete = sum(carries_damage_rule(chunk) for chunk in chunks)
18 indexed_words = sum(len(chunk.split()) for chunk in chunks)
19 print(f"overlap={overlap}: chunks={len(chunks)} indexed_words={indexed_words} complete={complete}")1overlap=0: chunks=2 indexed_words=14 complete=0
2overlap=2: chunks=3 indexed_words=18 complete=0
3overlap=4: chunks=5 indexed_words=28 complete=0This is the right posture for overlap: a configuration to evaluate, not a ritual to apply to every document.
File ingestion preserved headings and source locations. Use them. A heading boundary is more meaningful than an arbitrary word count when the source already says which rule belongs together.
LangChain's RecursiveCharacterTextSplitter is a common generic-text baseline. Its documentation describes trying separators in order, with the default sequence ["\n\n", "\n", " ", ""], so paragraphs are preferred before smaller cuts.[2] Start with explicit heading boundaries when the source provides them. Add the smaller-cut fallback only for sections that still exceed your size limit.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class PolicyChunk:
5 heading: str
6 body: str
7 indexed_text: str
8 locator: str
9
10def headed_chunks(markdown: str, source_locator: str) -> list[PolicyChunk]:
11 chunks: list[PolicyChunk] = []
12 heading = "Document"
13 body: list[str] = []
14
15 def flush() -> None:
16 if body:
17 body_text = " ".join(body)
18 chunks.append(
19 PolicyChunk(
20 heading,
21 body_text,
22 f"{heading}\n{body_text}",
23 f"{source_locator}#{heading.lower().replace(' ', '-')}",
24 )
25 )
26 body.clear()
27
28 for line in markdown.strip().splitlines():
29 if line.startswith("## "):
30 flush()
31 heading = line[3:]
32 elif line.strip():
33 body.append(line.strip())
34 flush()
35 return chunks
36
37policy = """## Damaged electronics
38Report within 48 hours with photos.
39## Unopened electronics
40Return within 14 days in original packaging."""
41
42for chunk in headed_chunks(policy, "page=7"):
43 print(chunk.indexed_text.replace("\n", " | "), "|", chunk.locator)1Damaged electronics | Report within 48 hours with photos. | page=7#damaged-electronics
2Unopened electronics | Return within 14 days in original packaging. | page=7#unopened-electronicsEach chunk's indexed_text includes its heading, so the searchable text keeps the condition attached to the deadline. The locator separately preserves the path back to original evidence.
A policy table needs an equally deliberate rule. Splitting a row away from its column names turns exact information into ambiguous fragments.
1table = [
2 "| Condition | Window | Evidence |",
3 "| --- | --- | --- |",
4 "| Damaged electronics | 48 hours | Photos |",
5 "| Unopened electronics | 14 days | Original packaging |",
6]
7
8def table_chunk(lines: list[str], heading: str) -> dict[str, str]:
9 return {
10 "heading": heading,
11 "text": "\n".join(lines),
12 "quality_check": "header_present" if lines[0].startswith("| Condition |") else "review",
13 }
14
15chunk = table_chunk(table, "Return windows")
16print(chunk["heading"], chunk["quality_check"], f"rows={len(table) - 2}")
17print("48 hours" in chunk["text"] and "Photos" in chunk["text"])1Return windows header_present rows=2
2TrueFor a long section, split inside that section while carrying its heading and original locator into every child chunk.
1def section_windows(text: str, heading: str, source: str, size: int) -> list[dict[str, str]]:
2 words = text.split()
3 return [
4 {
5 "text": " ".join(words[start : start + size]),
6 "heading": heading,
7 "source": source,
8 "chunk_id": f"{source}#{heading.lower().replace(' ', '-')}-{start // size}",
9 }
10 for start in range(0, len(words), size)
11 ]
12
13children = section_windows(
14 "Report within 48 hours with photos. Include order number and a clear image of damage.",
15 heading="Damaged electronics",
16 source="returns-policy-v3.pdf:page=7",
17 size=7,
18)
19
20for child in children:
21 print(child["chunk_id"], "|", child["heading"], "|", child["text"])1returns-policy-v3.pdf:page=7#damaged-electronics-0 | Damaged electronics | Report within 48 hours with photos. Include
2returns-policy-v3.pdf:page=7#damaged-electronics-1 | Damaged electronics | order number and a clear image of
3returns-policy-v3.pdf:page=7#damaged-electronics-2 | Damaged electronics | damage.
A chunking strategy isn't good because it sounds sophisticated. It works when labeled questions retrieve complete supporting evidence.
Start with a tiny, transparent score before involving a vector database. For the query terms {damaged, electronics, photos, hours}, the complete damage-policy chunk matches all four. A fragment that contains only {damaged, electronics} may rank, but it can't answer the question.
| Chunk | Matching query terms | Contains answer? |
|---|---|---|
| complete damage rule | 4 / 4 | yes |
| damaged-condition fragment | 2 / 4 | no |
| unopened-item rule | 1 / 4 | no |
The following lexical scorer is deliberately simple. It isolates the effect of boundaries; replace its score with real embeddings after the fixture and expected evidence are stable.
1import re
2
3def terms(text: str) -> set[str]:
4 return set(re.findall(r"[a-z0-9]+", text.lower()))
5
6def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]:
7 query_terms = terms(query)
8 return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"])))
9
10query = "damaged electronics photos hours"
11chunks = [
12 {"id": "damage-rule", "text": "Damaged electronics: report within 48 hours with photos."},
13 {"id": "general-rule", "text": "Unopened electronics: return within 14 days."},
14]
15
16hit = retrieve(query, chunks)
17print(hit["id"], hit["text"])1damage-rule Damaged electronics: report within 48 hours with photos.Now compare a broken fixed-window configuration against a section-aware configuration using the same question and the same expected evidence phrase.
1import re
2
3def terms(text: str) -> set[str]:
4 return set(re.findall(r"[a-z0-9]+", text.lower()))
5
6def top_chunk(query: str, chunks: list[str]) -> str:
7 query_terms = terms(query)
8 return max(chunks, key=lambda text: len(query_terms & terms(text)))
9
10query = "damaged electronics photos"
11expected = "Damaged electronics: report within 48 hours with photos."
12configs = {
13 "broken-fixed": [
14 "Damaged electronics: report within",
15 "48 hours with photos. Unopened electronics: return within 14 days.",
16 ],
17 "section-aware": [
18 "Damaged electronics: report within 48 hours with photos.",
19 "Unopened electronics: return within 14 days.",
20 ],
21}
22
23for name, chunks in configs.items():
24 hit = top_chunk(query, chunks)
25 print(f"{name}: complete={expected in hit}")1broken-fixed: complete=False
2section-aware: complete=TrueOne question proves little. Ship a small labeled set containing policy exceptions, tables, and boundary failures, then measure each candidate splitter on exactly that set.
1import re
2
3def terms(text: str) -> set[str]:
4 return set(re.findall(r"[a-z0-9]+", text.lower()))
5
6def retrieve(query: str, chunks: list[dict[str, str]]) -> dict[str, str]:
7 query_terms = terms(query)
8 return max(chunks, key=lambda chunk: len(query_terms & terms(chunk["text"])))
9
10chunks = [
11 {"id": "damage", "text": "Damaged electronics: report within 48 hours with photos."},
12 {"id": "unopened", "text": "Unopened electronics: return within 14 days."},
13 {"id": "refund", "text": "Approved refunds return to the original payment method."},
14]
15cases = [
16 ("damaged electronics photos", "damage", "48 hours"),
17 ("unopened electronics return", "unopened", "14 days"),
18 ("refund payment method", "refund", "original payment"),
19]
20
21passed = 0
22for query, expected_id, evidence in cases:
23 hit = retrieve(query, chunks)
24 ok = hit["id"] == expected_id and evidence in hit["text"]
25 passed += int(ok)
26 print(query, "PASS" if ok else "FAIL")
27print(f"summary={passed}/{len(cases)}")1damaged electronics photos PASS
2unopened electronics return PASS
3refund payment method PASS
4summary=3/3When you replace this lexical baseline with embeddings, the assertions stay useful: retrieve the correct source and retain text sufficient to answer.
Some questions match a narrow sentence, while a faithful answer needs its surrounding section. A parent-child design indexes small children for search and stores a pointer to the larger source section returned for generation.
1parents = {
2 "damage": "Damaged electronics: report within 48 hours with photos. Agents must attach the order number.",
3 "unopened": "Unopened electronics: return within 14 days in original packaging.",
4}
5children = [
6 {"text": "48 hours with photos", "parent_id": "damage"},
7 {"text": "14 days original packaging", "parent_id": "unopened"},
8]
9
10match = next(child for child in children if "photos" in child["text"])
11print(match["text"])
12print(parents[match["parent_id"]])148 hours with photos
2Damaged electronics: report within 48 hours with photos. Agents must attach the order number.Child windows inside a longer section also benefit from their section label. Embedding a child with a contextual header is a cheap hypothesis to test against your labeled set, not a promise of improvement.
1def indexed_text(heading: str, text: str, source: str) -> str:
2 return f"Source: {source}\nSection: {heading}\n{text}"
3
4child = indexed_text(
5 heading="Damaged electronics",
6 text="Report within 48 hours with photos.",
7 source="ShopFlow Returns Policy",
8)
9print(child)1Source: ShopFlow Returns Policy
2Section: Damaged electronics
3Report within 48 hours with photos.Structure-aware chunks handle many handbooks and policy pages. Some corpora force different choices:
| Failure after measuring baseline | Candidate experiment | What must still be checked |
|---|---|---|
| one paragraph shifts between multiple topics | semantic boundary detection | hard size cap and labeled-query score |
| tiny match lacks surrounding explanation | parent-child expansion | deduplication and generation budget |
| short child is ambiguous without its section | contextual header | retrieval comparison against no-header baseline |
| meaning depends on far-away document context | late chunking | model support, latency, and measured retrieval gain |
Semantic chunking places candidate boundaries where neighboring sentence representations change sharply. The next small lab uses transparent topic vectors, so you can see the boundary without trusting an external embedding service.
1import math
2
3def vector(sentence: str) -> list[float]:
4 lowered = sentence.lower()
5 return [
6 float(sum(word in lowered for word in ["damaged", "return", "photos", "hours"])),
7 float(sum(word in lowered for word in ["refund", "payment", "approved"])),
8 ]
9
10def cosine(left: list[float], right: list[float]) -> float:
11 dot = sum(a * b for a, b in zip(left, right))
12 left_norm = math.sqrt(sum(a * a for a in left))
13 right_norm = math.sqrt(sum(b * b for b in right))
14 return dot / (left_norm * right_norm) if left_norm and right_norm else 0.0
15
16sentences = [
17 "Damaged electronics require photos.",
18 "Report damaged items within 48 hours.",
19 "Approved refunds return to the original payment method.",
20]
21
22for left, right in zip(sentences, sentences[1:]):
23 similarity = cosine(vector(left), vector(right))
24 print(f"{similarity:.2f}", "boundary" if similarity < 0.50 else "keep together")11.00 keep together
20.32 boundaryLate chunking is a different escalation. Günther et al. encode the longer text first and apply chunk pooling after the transformer's contextual token representations have been computed.[3] That means a child representation can carry information from surrounding document text, but it requires a compatible long-context embedding stack and must earn its additional cost in evaluation.
For the ShopFlow policy corpus, a credible first release is straightforward:
| Design choice | Initial decision | Evidence to collect |
|---|---|---|
| Default boundary | heading-aware sections, recursive fallback | answerable-chunk rate and retrieval regression suite |
| Tables | keep header plus rows together | exact-value questions preserve correct row meaning |
| Overlap | off for complete policy sections; test for fallback prose | index size, duplicate hits, and labeled-query results |
| Metadata | source, locator, heading, checksum | cited answer can return to original record |
| Escalation | parent expansion before semantic or late chunking | failure examples that justify extra complexity |
Your release gate can be encoded as an executable manifest check.
1from hashlib import sha256
2
3chunks = [
4 {
5 "id": "damage",
6 "text": "Damaged electronics: report within 48 hours with photos.",
7 "source": "returns-policy-v3.pdf",
8 "locator": "page=7#damaged-electronics",
9 "heading": "Damaged electronics",
10 "quality": "ready",
11 },
12 {
13 "id": "broken",
14 "text": "48 hours with photos.",
15 "source": "returns-policy-v3.pdf",
16 "locator": "page=7#fragment",
17 "heading": "Fragment",
18 "quality": "review",
19 },
20]
21
22for chunk in chunks:
23 chunk["checksum"] = sha256(chunk["text"].encode()).hexdigest()
24
25required_metadata = ("source", "locator", "heading", "checksum")
26indexable = [
27 chunk for chunk in chunks
28 if chunk["quality"] == "ready"
29 and all(chunk[field] for field in required_metadata)
30]
31print(f"indexable={[chunk['id'] for chunk in indexable]}")
32print(f"blocked={len(chunks) - len(indexable)}")1indexable=['damage']
2blocked=1You now know how to take the clean evidence records from ingestion and turn them into retrieval units that can be evaluated.
48 hours without Damaged electronics. Fix: split at section boundaries and assert answerability.