The open-source AI model landscape has matured dramatically. In 2024, running a capable language model locally required serious GPU investment and deep ML expertise. In 2026, it's a solved problem - if you make the right choices.
This guide is opinionated. We've deployed local models for healthcare, legal, and financial services clients. Here's what we've learned.
When to Go Local
Cloud APIs (Claude, GPT-4o, Gemini) are better for most use cases. They're faster to deploy, continuously improving, and require zero infrastructure management. Go local only when you have a genuine reason:
- Data sovereignty: Patient records, legal privilege, classified data, financial PII
- Regulatory compliance: GDPR Article 44 restrictions, HIPAA, SOX
- Air-gapped environments: Military, government, high-security facilities
- Cost at scale: High-volume inference where API costs exceed infrastructure costs
- Latency requirements: Sub-100ms inference for real-time applications
If none of these apply, use cloud APIs. Seriously. The operational overhead of running your own models is not trivial.
Model Selection: Our Recommendations
As of early 2026, here's our honest assessment of the top open models:
Tier 1: General Purpose
- Llama 3.3 70B: The default choice. Excellent reasoning, strong instruction following, huge ecosystem. If you're unsure, start here.
- Qwen 2.5 72B: Surprisingly good. Outperforms Llama on multilingual tasks and structured output. Our pick for non-English deployments.
- Mistral Large: Strong on European languages and code. Good balance of quality and efficiency.
Tier 2: Efficient / Edge
- Llama 3.2 8B: Best small model. Runs on a single GPU. Good enough for classification, extraction, and simple generation.
- Phi-4: Microsoft's efficient model. Punches above its weight on reasoning. Good for constrained hardware.
- Gemma 2 9B: Google's efficient model. Excellent for summarisation and factual tasks.
Tier 3: Specialised
- DeepSeek Coder V2: Best for code-heavy use cases. We use it for automated code review and generation pipelines.
- BioMistral / Med-PaLM alternatives: For healthcare-specific terminology and clinical reasoning. Always validate with domain experts.
Infrastructure: What You Actually Need
Forget the marketing specs. Here's what we deploy in production:
For 70B-class models:
- 2x NVIDIA A100 80GB or 2x H100 (for full precision)
- 4x NVIDIA A6000 48GB (for quantised models - our preferred budget option)
- 128GB system RAM minimum
- NVMe storage for model weights (models are 40-140GB)
For 8B-class models:
- 1x NVIDIA RTX 4090 24GB (consumer hardware, seriously)
- 1x NVIDIA A10 24GB (datacentre option)
- 32GB system RAM
Inference Servers
The inference server is the most critical infrastructure choice. Our ranking:
- vLLM: Our default. PagedAttention gives the best throughput. Production-ready, well-maintained, excellent batching.
- TGI (Text Generation Inference): Hugging Face's server. Good alternative with nice built-in features. Slightly lower throughput than vLLM.
- Ollama: Perfect for development and small-scale deployment. Not our first choice for high-throughput production, but getting better fast.
RAG: The Missing Piece
A local model without access to your data is just a generic chatbot. RAG (Retrieval-Augmented Generation) is what makes local AI useful for enterprise:
- Document ingestion: Parse PDFs, Word docs, emails, databases into chunks
- Embedding: Convert chunks to vectors using a local embedding model (we use
bge-largeore5-mistral) - Vector storage: Qdrant or ChromaDB for similarity search. PostgreSQL with pgvector for simpler setups.
- Retrieval: Hybrid search (semantic + keyword) with re-ranking for best results
- Generation: Feed relevant context to the LLM with your query
The key insight: RAG quality depends more on your chunking strategy and retrieval pipeline than on the model. We've seen 8B models with great RAG outperform 70B models with poor retrieval.
Production Checklist
Before going live with a local LLM deployment, ensure you have:
- Load testing with realistic concurrent user counts
- Model versioning and rollback procedures
- Monitoring dashboards (latency, throughput, error rates, GPU utilisation)
- Cost tracking (compute costs vs equivalent API costs)
- Security hardening (API authentication, rate limiting, input sanitisation)
- Backup and disaster recovery for model weights and vector stores
- Update procedures for model upgrades (new versions release frequently)
Need help deploying a local AI stack? We've done it for healthcare, legal, and financial services clients. Book a strategy call and we'll assess your infrastructure requirements.