Skip to main content

Why How You Split Your Documents Matters More Than You Think

15 min read

Picture this: you’ve built a RAG pipeline. You’ve got your PDF loaded, your embeddings generated, your vector store humming along. You ask the system a straightforward question - “How many weeks of parental leave do I get as a Team Lead with 3 years of service?” - and the answer comes back vague, incomplete, or outright wrong. The information is right there in the document. So what went wrong?

The answer, more often than not, isn’t your embedding model. It isn’t your LLM. It’s how you split the document in the first place.

In this post, I’ll walk you through building a RAG system that compares two chunking strategies side by side - naive fixed-size chunking versus layout-aware chunking - and show you why the difference matters far more than most tutorials let on.

The Chunking Problem Nobody Talks About

Most RAG tutorials gloss over chunking. They’ll show you a quick text.slice() or a LangChain RecursiveCharacterTextSplitter and move on to the “interesting” parts: embeddings, vector databases, prompt engineering. But chunking is the interesting part - or at least, the part where things quietly go wrong.

When you chunk a document naively - splitting it into fixed-size character blocks - you’re making an implicit assumption: that semantic boundaries align with character counts. They don’t. A 250-character window doesn’t care that it just sliced a table row in half, separated a policy title from its description, or split “12 weeks of paid leave” across two chunks where one chunk says “12 weeks of” and the other says “paid leave for employees who…”.

The retrieval step then dutifully finds the most similar chunk to your query, but that chunk is missing half the answer. The LLM does its best with what it gets, but it’s working with a broken context. Garbage in, garbage out - except here, it’s fragmented in, hallucinated out.

Two Strategies, One Pipeline

To make this concrete, I built a system that runs both chunking strategies through an identical RAG pipeline and compares the results. The stack is intentionally lean: Node.js, Google’s Gemini API for embeddings and generation, and Vectra for local vector storage. No frameworks, no orchestration layers.

The pipeline follows a straightforward flow:

  1. Parse a PDF document
  2. Chunk it using both strategies
  3. Generate embeddings for all chunks
  4. Index everything in a vector store
  5. Run the same query against both chunk sets
  6. Compare the answers

Let’s look at each chunking strategy.

The Naive Approach: Fixed-Size Windows

The naive chunker does exactly what the name suggests. It slides a fixed-size window across the entire document text, producing uniform chunks with a small overlap to avoid hard boundary cuts:

const CHUNK_SIZE = 250;
const OVERLAP = 25;

export function naiveChunk(documents) {
  const chunks = [];
  let globalIndex = 0;

  for (const doc of documents) {
    const text = doc.text;
    const source = doc.filename;
    let chunkIndex = 0;

    for (let i = 0; i < text.length; i += CHUNK_SIZE - OVERLAP) {
      const end = Math.min(i + CHUNK_SIZE, text.length);
      const content = text.slice(i, end);

      chunks.push({
        id: `naive-${globalIndex}`,
        content: content.trim(),
        metadata: { source, chunkIndex, type: 'naive' }
      });

      globalIndex++;
      chunkIndex++;
    }
  }

  return chunks;
}

This is simple, predictable, and fast. It’s also completely blind to the document’s structure. A table cell, a heading, a list item - they’re all just characters in a string.

The Layout-Aware Approach: Structure Preserving Chunks

The layout-aware chunker takes a fundamentally different approach. Instead of treating the document as a flat string, it parses the text into semantic sections by detecting structural elements: headers, tables, lists, and paragraph boundaries.

The core of the strategy is a parseIntoSections function that iterates through lines and classifies them:

function isHeader(line) {
  if (/^##\s+[^#]/.test(line)) return true;
  if (/^\d+\.\s+[A-Z]/.test(line)) return true;
  if (/^[A-Z][a-zA-Z\s]+\d{4}$/.test(line)) return true;
  if (
    /^[A-Z][a-zA-Z\s]+$/.test(line) &&
    line.length > 20 &&
    line.length < 60 &&
    !line.match(/Package|Notes|Type|Structure|Bonus/)
  ) {
    return true;
  }
  return false;
}

function isTableRow(line) {
  if (/\s{2,}/.test(line.trim()) && line.trim().length > 10) return true;
  if (line.includes('|')) return true;
  if (line.includes('\t')) return true;
  return false;
}

function isListItem(line) {
  return /^[\s]*[•\-\*]\s/.test(line) || /^[\s]*\d+[.)]\s/.test(line);
}

Each header triggers a new section. Table rows get grouped together. List items stay with their parent content. The result is a set of chunks where each one represents a coherent semantic unit - a complete section, a full table, an entire list - rather than an arbitrary slice of characters.

When the parsed sections are assembled into chunks, they carry rich metadata about what they contain:

chunks.push({
  id: `layout-${globalIndex}`,
  content: sectionContent.trim(),
  metadata: {
    source,
    chunkIndex,
    type: 'layout-aware',
    section: currentHeader,
    layoutInfo: {
      isHeader: section.type === 'header',
      isTable: section.type === 'table',
      isList: section.type === 'list',
      section: currentHeader,
    },
  },
});

This metadata isn’t just for decoration. It enables downstream filtering, debugging, and more sophisticated retrieval strategies.

The Shared Pipeline

Both sets of chunks flow through the same embedding and retrieval pipeline. Embeddings are generated using Google’s gemini-embedding-001 model, processed in batches of 10 to respect rate limits:

export async function generateEmbeddings(chunks) {
  const batchSize = 10;
  const vectorChunks = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, Math.min(i + batchSize, chunks.length));

    const embeddingPromises = batch.map(async (chunk) => {
      const result = await ai.models.embedContent({
        model: 'gemini-embedding-001',
        contents: chunk.content,
      });

      return {
        ...chunk,
        embedding: result.embeddings[0].values,
      };
    });

    const batchResults = await Promise.all(embeddingPromises);
    vectorChunks.push(...batchResults);

    if (i + batchSize < chunks.length) {
      await new Promise((resolve) => setTimeout(resolve, 100));
    }
  }

  return vectorChunks;
}

Both chunk types get indexed into the same Vectra vector store. At query time, the RAGPipeline class embeds the user’s question, searches for similar chunks, filters by chunking type, and passes the retrieved context to Gemini for generation:

export class RAGPipeline {
  constructor(vectorStore) {
    this.vectorStore = vectorStore;
  }

  async query(question, chunkingType, topK = 3) {
    const queryEmbedding = await embedQuery(question);

    const allResults = await this.vectorStore.search(queryEmbedding, topK * 2);

    const retrievedChunks = allResults
      .filter((result) => result.chunk.metadata.type === chunkingType)
      .slice(0, topK);

    const context = retrievedChunks
      .map((result, idx) => `[${idx + 1}] ${result.chunk.content}`)
      .join('\n\n');

    const prompt = `You are a helpful assistant. Answer the question based only on the provided context.

Context:
${context}

Question: ${question}`;

    const response = await ai.models.generateContent({
      model: 'gemini-2.5-flash',
      contents: prompt,
    });

    return {
      query: question,
      retrievedChunks,
      generatedAnswer: response.text,
      chunkingType,
    };
  }
}

The key detail here is topK * 2 followed by the filter. Since both chunk types coexist in the same index, we over-fetch and then narrow down to the requested strategy. This ensures a fair comparison - both strategies are working from the same vector space.

Why the Difference Matters

When you run a query like “How many weeks of parental leave do I get as a Team Lead with 3 years of service?” against an employee benefits PDF, the naive chunker might retrieve a chunk that starts mid-sentence: “…for employees in Tier 2 positions. Parental leave entitlement is based on” - and cuts off right before the actual answer. The embedding matched well because the words “parental leave” and “Tier 2” are present. But the chunk is incomplete.

The layout-aware chunker, by contrast, retrieves the full parental leave section - header, table, and all - because it kept that semantic unit intact during chunking. The LLM gets a complete context and produces a complete answer.

This isn’t a theoretical distinction. It’s the difference between a RAG system that works in demos and one that works in production.

Expand here for the output 📄 Parsing PDF...

✂️ Creating chunks...
🧮 Generating embeddings...
Generating embeddings for 18 chunks...
Generating embeddings for 8 chunks...
✅ Done

🗄️ Setting up vector store...
✅ Store ready

🔍 Query: "How many weeks of parental leave do I get as a Team Lead with 3 years of service?"

🟡 NAIVE CHUNKING
Query embedding length: 3072
🟡 NAIVE RETRIEVED CHUNK:
: +8 weeks
10+ years: +12 weeks
Factor 3: Role Multiplier
Your nal total is then adjusted by role:
Individual Contributor: 1.0x (no change)
Team Lead: 1.25x
Manager: 1.5x
Director+: 2.0x
Example Calculation:
Sarah has worked here for 6 years as a Ma

Answer: The context doesn't contain enough information to answer the question, as it does not provide a base number of weeks for parental leave before adjustments. It only lists additions (+8 weeks, +12 weeks for 10+ years) and role multipliers.

🔵 LAYOUT-AWARE CHUNKING
Query embedding length: 3072
🔵 LAYOUT RETRIEVED CHUNK:
Parental Leave Calculation
To calculate your total parental leave entitlement, you need to consider three factors:
Factor 1: Base Entitlement
All employees start with a base of 8 weeks paid leave.
Factor 2: Service Bonus
Additional weeks are added based on years of service:
0-2 years: +0 weeks
2-5 years: +4 weeks
5-10 years: +8 weeks
10+ years: +12 weeks
Factor 3: Role Multiplier
Your nal total is then adjusted by role:
Individual Contributor: 1.0x (no change)
Team Lead: 1.25x
Manager: 1.5x
Director+: 2.0x
Example Calculation:
Sarah has worked here for 6 years as a Manager. Her calculation:
-- 1 of 4 --
Page 2 of 4
Base (8 weeks) + Service bonus (8 weeks) = 16 weeks
16 weeks × Manager multiplier (1.5x) = 24 weeks total
Calculate your leave by adding base + service bonus, then multiply by your role multiplier.
Answer: As a Team Lead with 3 years of service, you get 15 weeks of parental leave.

Calculation:
* Base Entitlement: 8 weeks
* Service Bonus (3 years falls into 2-5 years): +4 weeks
* Subtotal: 8 + 4 = 12 weeks
* Role Multiplier (Team Lead): 1.25x
* Total Leave: 12 weeks * 1.25 = 15 weeks

A Few Takeaways

Chunking is retrieval design. The way you split your documents determines what your retrieval step can find. No amount of prompt engineering will compensate for a chunk that’s missing half the answer.

Structure-aware doesn’t mean complex. The layout-aware chunker in this project is pattern-based - regular expressions detecting headers, tables, and lists. It’s not using a vision model or a document AI service. Simple heuristics go a long way when the alternative is completely ignoring structure.

Metadata is your friend. Tagging chunks with their structural type (table, list, header, section name) opens the door to filtered retrieval, better debugging, and hybrid strategies where you might weight table chunks differently from prose.

Test with real queries. The side-by-side comparison approach - running the same question through both strategies - is the fastest way to see where your chunking is failing. Build this into your evaluation pipeline early.

Wrapping Up

RAG has become table stakes for building AI applications that reason over documents. But the quality of your RAG system lives and dies in the details that most people skip over. Chunking is one of those details.

Before you reach for a more powerful embedding model or a larger context window, look at what you’re actually feeding into the pipeline. Chances are, the problem isn’t that your model can’t understand the answer - it’s that the answer was split across two chunks and neither one made the cut.

Sometimes the highest-leverage improvement isn’t a better model. It’s a better split.