LeetLLM
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 108 articles completed

🛠️Computing Foundations0/3
Python for AI EngineeringNumPy and Tensor ShapesData Structures for AI
📊Math & Statistics0/4
Probability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/11
Vectors, Matrices & TensorsNeural Networks from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧪Core LLM Foundations0/9
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFunction Calling & Tool UseFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧮ML Algorithms & Evaluation0/8
Linear Regression from ScratchValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines
🧰Applied LLM Engineering0/17
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingMCP & Tool Protocol StandardsPrompt Injection DefenseAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringPre-training Data at ScaleMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsDesign an Automated Support Agent
🎓Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/6
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/10
Scaling Laws & Compute-Optimal TrainingDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/13
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingAI Coding Workflow with AgentsAgent Memory & PersistenceHuman-in-the-Loop AgentsAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/14
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUFSpeculative DecodingLong Context Window ManagementMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnCore LLM FoundationsFile Ingestion for AI
🔍EasyRAG & Retrieval

File Ingestion for AI

Learn how real AI systems turn PDFs, HTML pages, OCR images, and Markdown files into clean text records ready for chunking and retrieval.

30 min readOpenAI, Anthropic, Google +25 key concepts
Learning path
Step 24 of 108 in the full curriculum
Function Calling & Tool UseChunking Strategies

Most RAG failures start before retrieval.

The user uploads a PDF. The app extracts text in the wrong order. The footer appears on every page. A scanned page becomes an empty string. A Markdown code block loses indentation. The retriever then indexes all of that noise, and the model gets blamed when answers are weak.

File ingestion is the process of turning messy source files into clean, traceable text records that later systems can chunk, embed, retrieve, cite, and debug.

This chapter builds on Function Calling & Tool Use. Tools let an LLM reach outside its context window. Ingestion decides what trustworthy context those tools and retrievers can find.

Document ingestion pipeline showing PDF, HTML, OCR image, and Markdown inputs normalized into provenance-rich text records before chunking. Document ingestion pipeline showing PDF, HTML, OCR image, and Markdown inputs normalized into provenance-rich text records before chunking.
Visual anchor: compare the four source formats on the left, then trace how each one becomes the same chunk-ready record shape on the right.

Why ingestion comes before chunking

Chunking splits text into pieces.

But chunking assumes you already have good text.

If the input is dirty, chunking spreads the dirt. A repeated footer becomes hundreds of near-duplicate chunks. A broken PDF table becomes nonsense text. A scanned page becomes a missing answer. A Markdown heading disappears, and the retriever loses a useful section label.

The right order is:

Diagram Diagram

The important output is not "a string."

The output is a record.

FieldExampleWhy it matters
source_idemployee_handbook.pdfTies chunks back to a file.
source_typepdfExplains parser path.
page7Supports citations.
headingRefund policyHelps retrieval and display.
textEmployees can...The content to chunk.
parserpypdfDebugs extraction changes.
checksumsha256:...Detects changed files.
qualityok, ocr, reviewKeeps bad text visible.

RAG, short for Retrieval-Augmented Generation, depends on retrieval quality as much as generation quality.[1]

PDFs: native text first

A PDF is a page layout format, not a clean article format.

Some PDFs contain selectable text. Some are scanned images. Some have columns, tables, footnotes, headers, and rotated text. A parser may extract text, but not always in the order a human reads it.

Start with native extraction because it's cheap and preserves text when available. Libraries such as pypdf can read pages and extract text from many PDFs.[2]

python
1from dataclasses import dataclass 2 3@dataclass 4class PageText: 5 page: int 6 text: str 7 quality: str 8 9def quality_label(text: str) -> str: 10 cleaned = text.strip() 11 if len(cleaned) < 40: 12 return "needs_ocr_or_review" 13 if "\ufffd" in cleaned: 14 return "bad_encoding" 15 return "ok" 16 17page = PageText( 18 page=7, 19 text="Refund policy\nEmployees can submit refund requests within 14 days.", 20 quality=quality_label("Refund policy\nEmployees can submit refund requests within 14 days."), 21) 22 23print(page) 24assert page.quality == "ok"

Expected output:

text
1PageText(page=7, text='Refund policy\nEmployees can submit refund requests within 14 days.', quality='ok')

PDF failure symptoms

SymptomLikely causeFix
Empty textScanned page or protected file.Route page to OCR or manual review.
Words out of orderMulti-column layout.Use layout-aware extraction or page image review.
Footer appears everywhereHeader/footer not removed.Detect repeated page text and drop it.
Table becomes word saladLayout table not preserved.Extract table separately or keep page image evidence.
Random replacement charactersEncoding issue.Flag page for alternate parser.

Don't index a bad extraction because "some text is better than no text."

Bad text can be worse than missing text because it creates confident retrieval noise.

OCR: when text is an image

OCR means Optical Character Recognition. It turns pixels into text.

Use OCR when native extraction fails or when the source is an image. Tesseract is a widely used open-source OCR engine, and its docs emphasize the importance of image quality and configuration.[3]

OCR should be observable because it can be wrong.

OCR input issueResult
Low resolutionMissing or confused characters.
Skewed scanLines merge or split.
Complex tableReading order breaks.
HandwritingAccuracy may drop sharply.
Multiple languagesNeeds language configuration.

Store whether text came from OCR. Later, when an answer cites a page, you may want to show lower confidence or let a user open the original image.

OCR routing rule

This small rule catches common cases before chunking.

python
1def needs_ocr(native_text: str, page_image_available: bool) -> bool: 2 if not page_image_available: 3 return False 4 cleaned = native_text.strip() 5 if not cleaned: 6 return True 7 if len(cleaned) < 40: 8 return True 9 if cleaned.count("\n") == 0 and len(cleaned) > 500: 10 return True 11 return False 12 13print(needs_ocr("", page_image_available=True)) 14print(needs_ocr("Refund policy with enough readable text on the page.", page_image_available=True)) 15assert needs_ocr("", True) is True

Expected output:

text
1True 2False

The exact thresholds will change by product. The habit matters more: route low-quality pages before indexing.

HTML: remove boilerplate

HTML pages include real content and page chrome.

Menus, sidebars, cookie banners, footer links, "related articles," and script text can all enter your parser. If you index them, retrieval will surface generic navigation text instead of the answer.

Beautiful Soup parses HTML into a tree that you can inspect and clean.[4]

python
1from bs4 import BeautifulSoup 2 3html = """ 4<html> 5 <body> 6 <nav>Home Pricing Docs</nav> 7 <main> 8 <h1>Refund policy</h1> 9 <p>Annual plans can be refunded within 14 days.</p> 10 </main> 11 <footer>Copyright Example Inc.</footer> 12 </body> 13</html> 14""" 15 16soup = BeautifulSoup(html, "html.parser") 17for tag in soup(["nav", "footer", "script", "style"]): 18 tag.decompose() 19 20text = soup.get_text("\n", strip=True) 21print(text) 22assert "Annual plans" in text 23assert "Pricing" not in text

Expected output:

text
1Refund policy 2Annual plans can be refunded within 14 days.

HTML cleaning checklist

CheckWhy
Remove nav/footer/script/style.Cuts repeated boilerplate.
Keep headings.Headings improve retrieval and citations.
Preserve links when useful.URLs can be evidence.
Drop hidden text.Invisible text may be tracking or injection.
Normalize whitespace.Cleaner chunks and previews.

HTML also creates a security problem: untrusted pages can contain prompt injection text. Don't treat web content as instructions. Treat it as quoted source data.

Markdown: preserve structure

Markdown looks simple, but structure still matters.

Headings, lists, tables, and code fences help readers and retrievers. If your parser flattens everything into one paragraph, you lose boundaries that chunking could use.

Good Markdown ingestion keeps:

  • heading hierarchy
  • list boundaries
  • table rows
  • code fences and language tags
  • source line ranges when useful

For technical docs, code fences are often the answer.

If a user asks "What command starts the service?", a chunk that preserved uvicorn app:app --reload is more useful than a paragraph that swallowed it.

Markdown record shape

python
1markdown = """ 2# Deploy 3 4Run the service: 5 6~~~bash 7uvicorn app:app --reload 8~~~ 9""" 10 11record = { 12 "source_id": "README.md", 13 "heading": "Deploy", 14 "text": markdown.strip(), 15 "contains_code": "~~~" in markdown, 16} 17 18print(record["heading"], record["contains_code"]) 19assert record["contains_code"] is True

Expected output:

text
1Deploy True

When a record contains code, retrieval and answer formatting should preserve code blocks. Otherwise the model may paraphrase commands in ways that don't run.

One normalized record

Format-specific parsers are different.

The downstream system should still receive one normalized record shape.

python
1from dataclasses import dataclass 2from hashlib import sha256 3 4@dataclass(frozen=True) 5class IngestedRecord: 6 source_id: str 7 source_type: str 8 text: str 9 page: int | None 10 heading: str | None 11 parser: str 12 quality: str 13 checksum: str 14 15def make_record( 16 source_id: str, 17 source_type: str, 18 text: str, 19 parser: str, 20 page: int | None = None, 21 heading: str | None = None, 22) -> IngestedRecord: 23 checksum = sha256(text.encode("utf-8")).hexdigest()[:12] 24 quality = "ok" if len(text.strip()) >= 40 else "review" 25 return IngestedRecord( 26 source_id=source_id, 27 source_type=source_type, 28 text=text.strip(), 29 page=page, 30 heading=heading, 31 parser=parser, 32 quality=quality, 33 checksum=checksum, 34 ) 35 36record = make_record( 37 source_id="handbook.pdf", 38 source_type="pdf", 39 page=7, 40 heading="Refund policy", 41 parser="pypdf", 42 text="Refund policy. Annual plans can be refunded within 14 days.", 43) 44 45print(record.quality, record.checksum) 46assert record.source_id == "handbook.pdf"

Expected output will look like this:

text
1ok a5ef3f7fbcda

The checksum lets you detect duplicates and changes. If the same page text appears again, you can skip reprocessing. If it changes, you can re-chunk and re-embed only affected records.

Quality gates before indexing

Add a gate before chunking.

GateFail whenAction
Minimum text lengthPage has almost no text.OCR or review.
Repeated boilerplateSame paragraph appears across many pages.Drop repeated text.
Encoding qualityReplacement chars appear often.Try another parser.
Language matchWrong language detected.Route to language-specific parser.
Source provenanceMissing source ID or page.Reject record.
Security scanText contains obvious prompt injection.Store as source, never instruction.

Unstructured document libraries can help route many file types through a common partitioning interface, but you still need product-specific quality checks.[5]

Generic parsing is a starting point. Your users' documents define the real bar.

Active learning exercise

Given this extracted page:

text
1Home | Pricing | Docs 2Refund policy 3Annual plans can be refunded within 14 days. 4Copyright Example Inc.

Create the chunk-ready record.

Strong answer:

json
1{ 2 "source_id": "pricing.html", 3 "source_type": "html", 4 "heading": "Refund policy", 5 "page": null, 6 "text": "Refund policy\nAnnual plans can be refunded within 14 days.", 7 "parser": "beautifulsoup", 8 "quality": "ok", 9 "removed_boilerplate": ["Home | Pricing | Docs", "Copyright Example Inc."] 10}

The key move is not the JSON formatting.

The key move is separating source text from boilerplate and keeping provenance.

Key takeaways

  • Ingestion turns messy files into clean, traceable records before chunking.
  • PDFs need native extraction first, then OCR or review when extraction quality is low.
  • HTML needs boilerplate removal or repeated navigation text will pollute retrieval.
  • Markdown ingestion should preserve headings, code fences, and tables.
  • Every record should keep source ID, type, page or heading, parser, quality, and checksum.
  • Bad extraction should be visible before it reaches embeddings.

Next, continue to Chunking Strategies. There, you'll take clean ingestion records and split them into retrieval units that keep enough context without drowning the model.

Evaluation Rubric
  • 1
    Chooses the right parser path for PDF, HTML, image, or Markdown input
  • 2
    Preserves provenance such as source URL, page number, and parser version
  • 3
    Detects low-quality extraction before indexing bad text
  • 4
    Separates parse, clean, normalize, and chunk-ready stages
  • 5
    Explains when OCR is necessary and why it should be observable
Common Pitfalls
  • Treating every file as plain text and losing tables, page numbers, and headings.
  • Indexing OCR output without a confidence or quality check.
  • Chunking before cleaning, which spreads boilerplate across many retrieval records.
  • Throwing away the original file checksum, making reprocessing and deduplication hard.
  • Assuming PDF text order always matches visual reading order.
Follow-up Questions to Expect

Key Concepts Tested
Document parsing versus text cleaningPDF text extraction and OCR fallbackHTML boilerplate removalMarkdown structure preservationIngestion records with provenance and parse status
References

pypdf Documentation.

pypdf Contributors. · 2026 · Official documentation

Beautiful Soup Documentation.

Leonard Richardson. · 2026 · Official documentation

Tesseract OCR Documentation.

Tesseract OCR Project. · 2026 · Official documentation

Unstructured: The Ultimate ETL for LLMs

Unstructured.io · 2024

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

PreviousFunction Calling & Tool UseNextChunking Strategies
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.