Quantisation
Quantisation is a technique that reduces the numerical precision of an AI model's parameters (from 32-bit to 8-bit or 4-bit), dramatically decreasing model size and increasing inference speed with minimal impact on quality.
What is Quantisation?
How Quantisation Works
Why Quantisation Matters for Business
Practical Applications
FAQ
Frequently asked questions
Quantisation does introduce some quality loss, but it is typically minimal. 8-bit quantisation is nearly indistinguishable from full precision for most tasks. 4-bit quantisation shows more degradation but remains suitable for many practical applications. The trade-off between size and quality is well-understood and manageable.
Quantised models can run on consumer GPUs, Apple Silicon Macs, and even CPUs. A 4-bit quantised 7B model requires roughly 4GB of RAM, making it accessible on most modern laptops. Larger quantised models (30B-70B) typically need GPUs with 16-48GB of VRAM.
For GPU inference, GPTQ and AWQ are popular choices. For CPU or mixed CPU/GPU inference, GGUF (used by llama.cpp and Ollama) is the standard. The best choice depends on your hardware, deployment tool, and whether you prioritise speed or quality.
Need help implementing this?
Our team can help you apply these concepts to your business. Book a free strategy call.