What is RAG?
Retrieval-Augmented Generation (RAG) is an architecture pattern that gives AI models access to your organisation's knowledge. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from your data and feeds them to the model as context before generating a response.
The result: AI that can answer questions about your specific policies, products, procedures, and data — with source citations.
Architecture Overview
A production RAG pipeline has four main stages: Ingest (process documents), Store (create and index embeddings), Retrieve (find relevant chunks), and Generate (produce an answer). Each stage has multiple design decisions that impact quality, cost, and latency.
Document Ingestion
Start with a robust document processing pipeline. You need to handle PDFs, Word documents, spreadsheets, emails, and web pages. Use tools like unstructured, LlamaParse, or custom parsers for domain-specific formats.
Clean extracted text by removing headers, footers, page numbers, and formatting artifacts. Preserve document structure — headings, lists, and tables carry important semantic information.
Chunking Strategies
Chunking determines how documents are split for embedding. Common strategies include fixed-size chunks (500-1000 tokens), semantic chunking (split at topic boundaries), and recursive character splitting. The right choice depends on your document types.
For technical documentation, semantic chunking preserves context better. For legal contracts, paragraph-level chunking maintains clause integrity. Always include chunk overlap (10-20%) to avoid losing context at boundaries.
Embedding Models
Embedding models convert text chunks into vector representations. Leading options include OpenAI's text-embedding-3-large, Cohere's embed-english-v3.0, and open-source alternatives like bge-large or e5-large.
For enterprise use, consider embedding dimensions (higher = more precise but more storage), multilingual support, and whether you need the model running locally for data privacy.
Vector Databases
Your vector database stores embeddings and performs similarity search. Top options include Qdrant (best performance, Rust-based), Pinecone (fully managed, easy to start), pgvector (stays in Postgres), and Weaviate (hybrid search).
For most enterprise deployments, we recommend Qdrant for its performance characteristics and flexible deployment options (cloud or self-hosted). If you're already invested in Postgres, pgvector reduces infrastructure complexity.
Retrieval Optimisation
Basic similarity search is rarely enough for production quality. Layer these techniques for better results:
Hybrid search: Combine vector similarity with keyword (BM25) search. This catches exact matches that embeddings might miss. Re-ranking: Use a cross-encoder model to re-score the top 20-50 results for relevance. Query expansion: Rephrase the user's question into multiple search queries to improve recall.
Generation & Prompting
Structure your generation prompt carefully. Include: system instructions, retrieved context with source metadata, the user's question, and output formatting requirements. Instruct the model to cite sources and to say “I don't know” when the context doesn't contain the answer.
Evaluation & Testing
Build an evaluation pipeline from day one. Create a test set of 50-100 question-answer pairs covering your most important queries. Measure retrieval precision, answer faithfulness, and answer relevance. Tools like RAGAS automate much of this.
Run evaluations automatically on every pipeline change. Track metrics over time to catch regressions. A/B test different chunking strategies, retrieval methods, and prompts.
Production Considerations
In production, add: caching (avoid re-embedding identical queries), rate limiting, usage tracking, error handling with fallbacks, and monitoring dashboards. Set up alerts for retrieval failures, low-confidence answers, and latency spikes.
Plan for document updates — you need a pipeline that re-indexes changed documents without re-processing the entire corpus. Implement versioning so you can roll back if a re-index introduces regressions.