How long does an AI implementation take?

Most single workflow implementations take 2-6 weeks from kickoff to production. Full AI transformation programmes run 6-12 weeks.

Do you work with specific AI models?

We are model-agnostic and work with all major providers including Anthropic Claude, OpenAI GPT, Google Gemini, Meta Llama, Mistral, and more.

Can you deploy AI on our own servers?

Yes. Our Local & Private AI service deploys models on your own infrastructure or private cloud.

RAG in Production: 10 Lessons from Real Deployments

Retrieval-augmented generation has become the default architecture for building knowledge-grounded AI applications. The concept is deceptively simple: retrieve relevant documents, stuff them into a prompt, and let a language model synthesise an answer. In practice, getting RAG to work reliably in production is significantly harder than any tutorial suggests.

After deploying RAG systems across legal, financial, and healthcare domains, here are ten lessons we wish we'd known from the start.

1. Chunking Strategy Is Everything

Most RAG tutorials default to fixed-size chunks of 500-1000 tokens with some overlap. This works for demos. It fails spectacularly on real documents. A 500-token chunk that splits a table in half, cuts a legal clause mid-sentence, or separates a heading from its content will produce garbage retrieval results no matter how good your embedding model is.

What actually works: Use document-aware chunking. Parse the structure of your documents first - headings, sections, tables, lists - and chunk along semantic boundaries. For PDFs, invest in a proper layout parser. For HTML, chunk by section. For contracts, chunk by clause. The extra engineering effort here pays for itself ten times over in retrieval quality.

2. Your Embedding Model Matters More Than Your LLM

Teams spend weeks evaluating GPT-4 versus Claude versus Gemini for generation, then use whatever embedding model was in the first tutorial they followed. This is backwards. The quality ceiling of your RAG system is set by retrieval, not generation. If the right documents never reach the LLM, even the best model cannot produce a good answer.

Benchmark your embedding model against your actual queries and documents. Run retrieval evaluations before you even connect a generative model. We've seen cases where switching from a general-purpose embedding model to a domain-fine-tuned one improved answer quality by 40% without changing anything else in the pipeline.

3. Hybrid Search Beats Pure Vector Search

Pure vector similarity search struggles with exact matches - product codes, legal references, specific dates, acronyms. If a user asks about "clause 14.3(b)", a vector search might return clauses that are semantically similar but not the exact one requested.

Combine vector search with keyword search (BM25 or similar) using reciprocal rank fusion. This gives you the semantic understanding of embeddings with the precision of keyword matching. Most vector databases now support hybrid search natively - use it. In our deployments, hybrid search consistently outperforms pure vector search by 15-25% on retrieval accuracy.

4. Retrieval Evaluation Is Non-Negotiable

You cannot improve what you do not measure. Yet most teams deploy RAG systems without any systematic evaluation of retrieval quality. They eyeball a few queries, decide it "looks good", and ship it.

Build a retrieval evaluation set: 50-100 representative queries paired with the documents that should be retrieved. Measure recall@k and precision@k. Track these metrics over time. When you change your chunking strategy, embedding model, or index configuration, re-run the evaluation. This is the single most impactful practice for maintaining RAG quality in production.

5. Context Window Size Is Not a Substitute for Good Retrieval

"Just increase k and stuff more documents into the context" is tempting, especially with models that support 100K+ token windows. It's also a trap. Stuffing irrelevant documents into the context increases noise, raises costs, adds latency, and - counter- intuitively - often reduces answer quality. Models get distracted by irrelevant information, a phenomenon well-documented in the "lost in the middle" research.

The rule of thumb: Retrieve the minimum number of chunks needed to answer the query. Start with k=3-5 and only increase if your evaluation shows it helps. Use a reranker to ensure the most relevant chunks appear first. Quality of retrieved context always beats quantity.

6. Metadata Filters Are Your Secret Weapon

Raw semantic search across your entire corpus is rarely what you want. In practice, users are usually asking about a specific document, time period, department, or category. If you can narrow the search space before running vector similarity, you dramatically improve both speed and accuracy.

Invest in rich metadata extraction during ingestion: document type, date, author, department, version, and any domain-specific attributes. Then expose these as filters in your retrieval pipeline. A query like "what are the payment terms in the Acme contract?" should filter to Acme-related documents before searching, not search across every document and hope the right one floats to the top.

7. Handle "I Don't Know" Gracefully

One of the most dangerous failure modes in RAG is confident hallucination. The system retrieves vaguely related documents and the LLM generates a plausible-sounding but incorrect answer. In regulated industries, this can have serious consequences.

Implement confidence thresholds. If the top retrieved documents have low similarity scores, or if the reranker scores them poorly, return "I don't have enough information to answer this confidently" rather than guessing. Users trust a system that admits its limitations far more than one that occasionally makes things up. We also include source citations with every answer so users can verify claims against the original documents.

8. Ingestion Pipelines Need Production Engineering

The data ingestion pipeline - parsing documents, chunking, embedding, and indexing - is often treated as a one-time script. In reality, it's a critical production system. Documents change, new ones arrive, old ones are retracted. Your pipeline needs to handle incremental updates, deduplication, versioning, and failure recovery.

Build your ingestion pipeline with the same rigour you'd apply to any data pipeline: idempotent operations, dead-letter queues for failed documents, monitoring for ingestion lag, and the ability to re-index from scratch when needed. We've had deployments where a single malformed PDF crashed the ingestion pipeline and silently stopped indexing new documents for days before anyone noticed. Monitoring and alerting are essential.

9. Monitor Everything in Production

RAG systems degrade silently. Your embedding model drifts as query patterns change. New document types break your chunking logic. Index performance degrades as the corpus grows. Without monitoring, you only discover these issues when users start complaining.

At a minimum, track: retrieval latency (p50, p95, p99), retrieval relevance scores, generation latency, token usage, error rates, and user feedback signals (thumbs up/down, query reformulations). Set up alerts for anomalies. We run weekly retrieval evaluations against our benchmark set to catch quality degradation early. The cost of this monitoring is trivial compared to the cost of serving bad answers.

10. Start Simple, Iterate Based on Data

It's tempting to build the most sophisticated RAG architecture from the start: multi-stage retrieval, query decomposition, HyDE, agentic RAG with tool use. Resist this temptation. Start with the simplest architecture that could work - basic chunking, a single embedding model, straightforward retrieval - and measure its performance against your evaluation set.

Then improve iteratively based on data. Analyse your failure cases. Is the problem retrieval? Improve chunking or switch embedding models. Is it generation? Improve your system prompt or add few-shot examples. Is it a specific query type? Add a specialised handler. Every improvement should be backed by a measurable gain on your evaluation set. This disciplined approach delivers better results than architectural complexity for its own sake.

Building a RAG system that actually works in production takes more than a weekend tutorial. If you're deploying RAG for your business and want to get it right the first time, book a strategy call and we'll walk through your specific requirements.