Running AI models locally has gone from "hobbyist dream" to "serious business option" in under two years. Open-source models like Llama, Mistral, and Qwen now rival cloud APIs for many tasks — and when you run them on your own hardware, your data never leaves the building.
But the first question everyone asks is the same: what hardware do I actually need?
This guide gives you a straight answer. We cover the four build tiers that matter, the VRAM maths behind them, and a free interactive calculator so you can spec a build for your exact use case.
Why Run AI Locally?
Before we talk hardware, let's be clear about why you'd bother. Cloud APIs like Claude and GPT-4o are excellent — and for many businesses, they're the right choice. Local deployment makes sense when:
- Data privacy is non-negotiable. Client data, patient records, legal privilege — some data simply cannot leave your network.
- You want predictable costs. No per-token billing. Once you own the hardware, inference is essentially free.
- You need offline capability. Air-gapped environments, unreliable connectivity, or latency-sensitive applications.
- You're processing high volumes. At scale, self-hosted inference is dramatically cheaper than API calls.
If none of these apply, cloud APIs are probably your best bet. If one or more does, read on.
The Only Number That Matters: VRAM
Everything in local AI comes back to VRAM — the memory on your GPU. It determines which models you can run, at what quality, and how fast.
Here's the rough formula:
VRAM needed = Model parameters × Bytes per parameter
+ KV cache (context × layers × users)Quantisation is what makes local AI practical. Instead of storing each parameter as a 16-bit float (2 bytes), you can compress to 8-bit, 5-bit, 4-bit, or even 3-bit — trading a small amount of quality for a massive reduction in VRAM.
VRAM Quick Reference
| Model | FP16 | Q8 | Q5 | Q4 |
|---|---|---|---|---|
| 7B (Llama 3.2, Mistral 7B) | 14 GB | 7 GB | 4.4 GB | 3.5 GB |
| 13B (CodeLlama 13B) | 26 GB | 13 GB | 8.1 GB | 6.5 GB |
| 30B (Qwen 2.5 32B) | 60 GB | 30 GB | 18.8 GB | 15 GB |
| 70B (Llama 3.3 70B) | 140 GB | 70 GB | 43.8 GB | 35 GB |
| 120B+ (Falcon 180B, etc.) | 240+ GB | 120+ GB | 75+ GB | 60+ GB |
These are base estimates. Longer context windows and concurrent users add to the requirement. Use our hardware calculator for a precise figure.
The Four Build Tiers
Tier 1: Entry (£500 - £800)
GPU: NVIDIA RTX 4060 (8 GB) or RTX 3060 12 GB
CPU: AMD Ryzen 5 or Intel i5
RAM: 32 GB DDR5
Storage: 1 TB NVMe SSD
What you can run: 7B models at Q4/Q5 quantisation. That gives you a solid local chatbot, code assistant, or document summariser. Models like Llama 3.2 8B and Mistral 7B run comfortably at usable speeds.
Best for: Individual developers, small experiments, proof-of-concept testing.
Tier 2: Enthusiast (£1,200 - £2,000)
GPU: NVIDIA RTX 4070 Ti Super (16 GB) or RTX 4080 (16 GB)
CPU: AMD Ryzen 7 or Intel i7
RAM: 64 GB DDR5
Storage: 2 TB NVMe SSD
What you can run: 13B models at full quality, 30B models at Q4. This is the sweet spot for serious local AI work. Qwen 2.5 32B at Q4 is genuinely impressive on this hardware.
Best for: Small teams, production-ready single-user deployments, development workstations.
Tier 3: Prosumer (£3,000 - £5,000)
GPU: NVIDIA RTX 4090 (24 GB) or 2× RTX 4070 Ti Super
CPU: AMD Ryzen 9 or Intel i9
RAM: 128 GB DDR5
Storage: 4 TB NVMe SSD
PSU: 1000W+
What you can run: 70B models at Q4/Q5 quantisation. Llama 3.3 70B on a 4090 with Q4 quantisation is genuinely competitive with cloud APIs for many tasks. You can also run multiple smaller models simultaneously.
Best for: Serious business deployments, multi-user inference, production workloads.
Tier 4: Mac Studio (£2,000 - £5,000+)
Hardware: Apple Mac Studio with M4 Ultra (up to 192 GB unified memory)
Why it's different: Apple's unified memory architecture means the GPU can access all system RAM. A 192 GB Mac Studio can run 70B models at high quantisation without any GPU memory bottleneck.
Trade-offs: Slower tokens-per-second than NVIDIA GPUs at the same price point, but much simpler setup, lower power consumption, and silent operation. For a shared office environment, this matters.
Best for: Teams that value simplicity, quiet operation, and macOS integration. Excellent for running larger models where VRAM is the bottleneck.
The Software Stack
Hardware is only half the story. Here's what we recommend for running models:
- Ollama: The easiest way to get started. One-command model downloads, built-in API server. Start here.
- llama.cpp: The engine behind most local inference. Maximum performance and flexibility for advanced users.
- vLLM: Production-grade serving with batching and OpenAI-compatible API. Best for multi-user deployments.
- Open WebUI: Beautiful chat interface for Ollama. Gives you a ChatGPT-like experience with local models.
- LM Studio: GUI-based model manager for Mac and Windows. Great for non-technical users.
Practical Tips
- Start with Q4 quantisation. The quality loss is minimal for most business tasks, and the VRAM savings are huge.
- Buy more VRAM, not a faster GPU. For LLM inference, memory capacity matters more than compute speed.
- Consider used enterprise GPUs. NVIDIA A100 40 GB cards are available second-hand for much less than new consumer cards with similar VRAM.
- Don't forget cooling. GPUs under sustained AI inference load run hot. Ensure your case has adequate airflow.
- Budget for power. A dual-GPU setup can draw 600-800W under load. Factor electricity costs into your ROI calculation.
Try Our Hardware Calculator
We've built a free AI Hardware Calculator that lets you input the model size, quantisation level, context length, and number of users — and get a precise VRAM estimate, GPU recommendation, and full build spec with estimated cost.
It's the fastest way to answer "what do I need?" for your specific use case.
When to Call Us
If you're planning a local AI deployment for your business, we can help with hardware specification, model selection, deployment, and ongoing support. We've deployed local AI infrastructure for healthcare, legal, and financial services clients across the UK.
Book a free strategy call and we'll help you spec the right build for your needs.