The Day My RAG “Worked” But Found Nothing

I had a RAG pipeline running. The model was good. The vector database was fast. The embeddings were correct. And yet, one in three queries returned irrelevant or incomplete answers.

I spent days investigating the model, the prompt, the similarity threshold. Nothing. The problem was something I’d treated as a “trivial decision” in the project’s first hour: how I was splitting the documents.

A crucial sentence — the exact answer to the user’s query — was being split in half between two chunks. Neither chunk made sense alone. The vector search couldn’t find either as relevant. And the model, without adequate context, improvised.

That experience taught me something that FloTorch’s February 2026 benchmark later confirmed with data: 80% of RAG failures trace to the ingestion and chunking layer, not the LLM. Most developers spend weeks optimizing prompts and swapping models while retrieval silently returns wrong context every third query.

Chunking is your pipeline’s silent hero — or its devastating villain.

The 2026 Paradox: Simple Beats Complex

Before diving into strategies, I need to share the result that most surprised me this year.

The FloTorch benchmark (February 2026) tested 7 chunking strategies across 50 academic papers with thousands of queries. The result shocked the community:

Recursive 512-token chunking placed first at 69% accuracy. Semantic chunking — which sounds more sophisticated — landed at 54%, producing fragments averaging just 43 tokens (too small to carry sufficient context).

Digital Applied’s analysis confirmed: chunking choice can swing recall by up to 9% on the same corpus. And the universal overlap rule “is no longer safe to assume” — a January 2026 systematic analysis using SPLADE and Mistral-8B found overlap provided no measurable benefit, only increased indexing cost.

The lesson: sophistication isn’t always better. The “boring” 512-token recursive strategy outperformed approaches that cost 14x more to process.

The Classic Approaches

1. Fixed-Size Chunking

The simplest. Sets a strict token limit — for example, cutting every 500 tokens. It’s “blind”: cuts at token 500 even if that splits a vital sentence in half.

When to use: quick prototypes, initial tests, validating RAG works for your use case. Implementation in 5 minutes.

When to avoid: anything in production where accuracy matters.

2. Overlap Chunking

Adds an overlap margin — each block “inherits” tokens from the previous one (typically 10-20% overlap). For a 500-token chunk, 50-100 shared tokens.

The benefit: prevents abrupt context loss at boundaries. The 2026 caveat: the January study showed that in certain scenarios, overlap doesn’t help and only increases storage costs. Test on your corpus before assuming you need it.

3. Recursive Chunking

Instead of cutting blindly, uses natural language separators. First tries to split by paragraphs; if the block is still too large, splits by sentences; then by clauses. Respects text structure.

The gold standard for project starts. The FloTorch benchmark places recursive 512 tokens as #1 in accuracy among all 7 strategies tested. Balances speed, cost, and quality. My personal recommendation: start here.

The Next Level

4. Semantic Chunking

Here we leave word counts behind and focus on meaning. The system generates embeddings for each sentence, measures similarity between them, and groups sentences belonging to the same concept. When the topic shifts dramatically, it cuts.

The theory is elegant. In practice, 2026 data is ambiguous. FloTorch placed semantic at 54% — below recursive. The Chonkie benchmark showed it’s 14x slower (0.33 MB/s vs 4.82 MB/s for token-based). A 10 GB corpus that indexes in minutes with token splitting takes hours with semantic.

A January 2026 analysis found something surprising: sentence chunking matched semantic up to ~5,000 tokens — at a fraction of the cost.

When it’s worth it: complex texts where concepts shift abruptly (multidisciplinary academic papers, legal documents with distinct clauses). When it’s not: large corpora where processing cost matters.

5. Contextual Retrieval

The elite technique of 2026. Before saving any chunk to the vector database, an LLM analyzes the block and enriches it with context from the entire document.

Instead of saving “Profit grew 10%,” the LLM transforms it into: “In Company X’s cloud division, Q3 2025, profit grew 10%.”

Anthropic’s research on Contextual Retrieval showed significant accuracy gains. But cost scales proportionally: each chunk requires an LLM call for enrichment. For millions of documents, the budget breaks.

The CDTA paper (Cross-Document Topic-Aligned) from UIUC (January 2026) took this to the extreme: chunking that reconstructs knowledge at corpus level. On HotpotQA (multi-hop reasoning), it achieved 0.93 faithfulness vs 0.83 for contextual retrieval and 0.78 for semantic — 12% above industry best practice (p < 0.05).

The “Context Cliff”: Why Chunks Aren’t Just About Size

A January 2026 finding that changed how I think about chunking: there’s a “context cliff” around 2,500 tokens where response quality drops abruptly. Above that point, bigger chunks aren’t better — they’re worse.

This connects to the “lost in the middle” phenomenon and context rot I discussed in the context window post. Model attention concentrates at the beginning and end of input. Middle information gets processed with less confidence. Chunks of 2,500+ tokens push critical information into the attention “dead zone.”

Practical recommendation: chunks between 256-512 tokens are the sweet spot for most use cases. Large enough to carry context. Small enough to avoid the context cliff.

My Updated Playbook

After taking plenty of hits and researching 2026 benchmarks, here’s what I use:

Step 1: Start with recursive 512 tokens. It’s FloTorch’s #1. Simple implementation. Low cost. Surprisingly high accuracy. Add 10-15% overlap, but test whether it actually helps on your specific corpus.

Step 2: Implement hybrid retrieval. Combine vector search (dense) with BM25 (keyword). Many failures that look like chunking problems are actually retrieval failures — pure vector search misses keyword matches.

Step 3: Add re-ranking. After initial search, use a cross-encoder to reorder by actual relevance. This compensates for chunking imprecisions.

Step 4: Selective Contextual Retrieval. Apply contextual enrichment only on the most critical documents (contracts, regulations, safety manuals). Not on the entire corpus. The cost doesn’t justify it for generic documentation.

Step 5: Monitor retrieval accuracy. Don’t assume it works. Implement metrics (RAGAS, faithfulness, answer relevancy) and monitor in production. 80% of RAG problems are in chunking/retrieval — and you only discover them by monitoring.

Conclusion: The Silent Hero (Or the Villain)

Chunking isn’t sexy. It doesn’t appear in demos. Nobody posts Twitter threads about “my new chunking strategy.” But it’s the decision that most impacts your RAG’s quality — more than the model, more than the prompt, more than the vector database.

And the most counterintuitive result of 2026 is that the “simple” strategy frequently beats the “sophisticated” one. Recursive 512 tokens beat semantic in a rigorous benchmark. Overlap doesn’t always help. And smaller chunks (256-512) outperform larger ones in most scenarios.

Sophistication is tempting. But results are what matter. And 2026 data is saying something clear: start simple, measure, and add complexity only when data justifies it.

Share if this saved your pipeline:

80% of RAG failures come from chunking. And the simplest strategy is often the best. 2026 data proved it — and my wallet confirms.


Read Also