RAG in Production: Beyond the Basic Retrieval Pipeline
Basic RAG is easy to demo and hard to get right in production. Chunking strategies, reranking, hybrid search, and evaluation pipelines are what separate toy prototypes from reliable AI search systems.
# RAG in Production: Beyond the Basic Retrieval Pipeline Retrieval-Augmented Generation (RAG) is the most widely deployed pattern for grounding LLMs in private data. But most teams hit the same wall: demos work, production doesn't. Here's what separates production RAG from toy RAG. ## The Basic Pipeline (and Its Limits) ``` User query → Embed → Vector search → Top-K chunks → LLM prompt → Answer ``` This works in demos. In production, it fails due to: poor chunking, missing metadata, semantic drift in embeddings, and no quality evaluation. ## 1. Chunking Strategy Chunking is where most RAG systems break down. | Strategy | Best For | |---|---| | Fixed-size (512 tokens) | Quick prototyping | | Sentence-window | Preserving context around relevant sentences | | Recursive character | Mixed document types | | Semantic chunking | Splitting at topic boundaries | Use **sentence-window chunking**: store small chunks for precision retrieval, but expand context to surrounding sentences before sending to the LLM. ## 2. Hybrid Search Pure vector search misses exact keyword matches. Combine it with BM25: ```python # Hybrid search with RRF (Reciprocal Rank Fusion) vector_results = vector_store.search(query_embedding, top_k=20) bm25_results = bm25_index.search(query_text, top_k=20) fused = reciprocal_rank_fusion([vector_results, bm25_results]) final = fused[:5] ``` ## 3. Reranking After retrieval, pass candidates through a cross-encoder reranker (Cohere Rerank, BGE-Reranker) to reorder by true relevance. ```python reranked = cohere.rerank( query=query, documents=[c.text for c in candidates], model="rerank-v3.5", top_n=5 ) ``` Cross-encoders are slower but dramatically more accurate than embedding similarity alone. ## 4. Metadata Filtering Always store and filter on metadata: ```python results = vector_store.search( query_embedding, filter={"doc_type": "policy", "department": "engineering"}, top_k=10 ) ``` ## 5. Evaluation Pipeline You can't improve what you don't measure. Build an eval suite: - **Retrieval recall**: Are the right chunks being retrieved? - **Faithfulness**: Does the answer only use retrieved content? - **Answer relevancy**: Does the answer address the question? Tools: Ragas, TruLens, LangSmith. Production RAG is a data engineering problem as much as an AI problem.
