Most teams tuning a retrieval-augmented generation (RAG) system reach for the same knobs first: a better embedding model, a larger context window, a reranker. Those help. But on real enterprise documents, the setting that quietly decides whether retrieval works is one most pipelines never revisit after day one — how the document was split into chunks before anything was embedded.
This guide is for engineers and architects who have a RAG system in production and a sense that the retrieval misses are not random — they cluster around the documents with the most structure.
1. What chunking actually decides
A chunk is the smallest unit your retriever can return. When a user asks a question, the system embeds the query, finds the nearest chunks, and hands them to the model as context. The model can only reason over what the retriever surfaced — so the chunk boundary is not a preprocessing detail. It is the resolution limit of the entire system.
Two failure shapes follow. Chunks that are too large carry several unrelated topics, so the embedding becomes an average that matches everything weakly and nothing precisely. Chunks that are too small sever a fact from the context that gives it meaning — a number without its column header, a clause without the section it modifies. Both push the same outcome: the model receives context that looks relevant and isn't, and it answers anyway.
The instinct is to fix this by adjusting chunk size. But size is the wrong dial when the real problem is that the splitter ignored the document's structure. A 500-token window does not know it just cut a table in half. It only counts tokens.
2. Fixed-size chunking and where it quietly breaks
The default in most RAG tutorials is fixed-size chunking: slide a window of N tokens across the text, optionally overlapping consecutive windows. It is trivial, fast, and entirely blind to what it cuts.
On clean, single-column prose, this is fine. The trouble starts the moment a document has layout: a section heading that belongs to the paragraphs beneath it, a table whose cells only mean something next to their row and column labels, a multi-column page where reading order is not top-to-bottom.
Here the token window does real damage. It detaches a heading from its body, so one chunk has a topic with no content and the next has content with no topic. It splits a table across two chunks, so neither half is interpretable. None of this shows up on a tidy test PDF — it shows up in production, on the messy documents that made you buy a RAG system in the first place, as confident wrong answers. Those are the hardest kind to catch.
3. Why tables are the breaking point
If one structure exposes a chunking strategy, it is the table. A table encodes meaning in two dimensions: a cell's value is defined by its row and its column at once. Flatten it into a token stream and that meaning collapses.
Picture a financial statement where a row reads "Net revenue 52,340 48,110 41,920" under three fiscal-year columns. A fixed-size boundary that lands a few rows above puts those numbers in a chunk with no header row. A query for last year's net revenue retrieves three numbers and a label, with nothing to say which number belongs to which year. The retriever did its job; the chunk was built to fail.
This is why purely text-based RAG struggles on the documents enterprises care about most — financial reports, specifications, claims packets. They are dense with tables, and the fix is to chunk based on the document's visual structure, not just its character stream. A vision-capable model that sees a block is a table and keeps it whole preserves the relationships a token counter destroys.
4. Designing a layout-aware chunking strategy
Layout-aware chunking starts from a different premise: the document already tells you where the boundaries are, if the pipeline can see them. Three moves make it work.
Respect structure first, size second. Chunk along the document's own divisions — sections, sub-sections, table units, lists — and fall back to size limits only when a single unit is too large. Size becomes a constraint applied within structure, not a ruler laid blindly over the text.
Keep tables and their headers together. A table, with its header row, should travel as one chunk whenever it fits, so every retrieved value still carries the labels that define it. When a table is too large, split it by rows and repeat the header in each piece, so no fragment is ever headerless.
Carry the structural path as metadata. Store where each chunk sits — its section heading, its place in the hierarchy — alongside the chunk. A query that implicitly refers to a location ("the penalty terms in the indemnification section") can then be disambiguated by that path, not just by raw text similarity.
All three depend on one upstream capability: the system has to recognize structure in the first place. If the parser preserves layout — tables as tables, headings bound to sections — layout-aware chunking has something to work with. If it flattens everything to a character stream, no chunking strategy downstream can recover what was lost. This is why document parsing sits beneath every retrieval decision, a point we develop in our overview of enterprise RAG architecture.
This is the layer where Korea Deep Learning's DEEP Agent is designed to contribute. Because it reads documents with a vision-language model rather than flattening them through legacy OCR, it preserves the structural cues — table boundaries, header associations, reading order — that layout-aware chunking relies on, and emits structured units that carry their place in the document.
5. Testing whether chunking is your problem
Before rebuilding a pipeline, confirm chunking is the bottleneck. Two checks isolate it.
Read the retrieved chunks, not just the answers. When the system answers wrong, look at what the retriever actually returned. If the chunks are fragments — a headerless block of numbers, a heading with no body — the failure is upstream of the model, and no prompt change will fix a broken chunk.
Test on your hardest documents. Chunking failures hide on clean single-column PDFs and surface on table-heavy, multi-column files. Evaluate retrieval specifically on the document types that carry the most structure, because that is where the gap between fixed-size and layout-aware is widest.
When these checks point at chunking, switching to a layout-aware strategy usually yields a larger gain than another round of embedding-model tuning — because it repairs the input every other component depends on.
Conclusion
In a RAG system, the chunk is the unit of retrieval, and its boundary is the resolution limit of everything downstream. Fixed-size chunking sets that boundary by counting tokens — fine on clean prose, quietly broken on the structured documents enterprises actually run. Layout-aware chunking sets it by reading the document's own structure, keeping the units that carry meaning intact. And because it depends on extraction that preserves structure, the chunking decision and the parsing decision are really one decision.
Want to see the difference on your own documents? Bring a table-heavy PDF your current pipeline struggles with, and watch how preserving structure at extraction time changes what the retriever can return. Request a demo at koreadeep.com.
Frequently asked questions
What is layout-aware chunking in RAG? A chunking approach that splits documents along their own structural boundaries — sections, headings, tables, lists — instead of by a fixed token count. The goal is for every chunk to stay meaningful on its own, so the retriever returns units the model can reason over rather than fragments severed from their context.
Why does fixed-size chunking fail on tables? A table encodes meaning in rows and columns at once. A fixed-size boundary inside a table separates values from the header row that labels them, so a retrieved fragment becomes numbers without context. Keeping a table — with its header — as a single chunk preserves the relationships that make the values interpretable.
Is chunking more important than the embedding model? They solve different problems, but chunking comes first, so its errors propagate. A strong embedding model cannot rescue a chunk built wrong — if text was severed from its context before embedding, the vector encodes a fragment. Fixing chunking often yields a larger retrieval gain than further embedding-model tuning.
How do I know if chunking is hurting my RAG system? Inspect the chunks the retriever returns on failed queries. If they are fragments — headerless numbers, topic-less headings — the problem is upstream of the model. Testing on table-heavy documents rather than clean samples is the fastest way to expose it.