GroveAI
Glossary

Quantisation

Quantisation is a technique that reduces the numerical precision of an AI model's parameters (from 32-bit to 8-bit or 4-bit), dramatically decreasing model size and increasing inference speed with minimal impact on quality.

What is Quantisation?

Quantisation is the process of reducing the precision of the numbers used to represent a neural network's parameters. A standard model stores each parameter as a 32-bit floating-point number, which provides high precision but requires significant memory and compute. Quantisation converts these to lower-precision formats — 16-bit, 8-bit, or even 4-bit — reducing the model's memory footprint proportionally. The key insight behind quantisation is that neural networks are remarkably tolerant of reduced precision. A model that is four times smaller due to 8-bit quantisation typically retains 95-99% of its original quality. This makes quantisation one of the most practical techniques for making large AI models accessible on consumer hardware.

How Quantisation Works

There are several approaches to quantisation. Post-training quantisation (PTQ) takes a pre-trained model and converts its weights to lower precision after training is complete. This is the simplest approach and requires no retraining. Quantisation-aware training (QAT) incorporates reduced precision during the training process itself, often producing better results but requiring more effort. Common quantisation formats include GPTQ (GPU-optimised quantisation), GGUF (used by llama.cpp for CPU and mixed inference), AWQ (activation-aware weight quantisation), and bitsandbytes (integrated with popular training frameworks). Each format has different trade-offs in speed, quality, and hardware compatibility. The quality impact varies by model and quantisation level. 8-bit quantisation is nearly lossless for most models. 4-bit quantisation introduces more noticeable degradation but is often acceptable for many practical applications. Below 4-bit, quality drops more sharply.

Why Quantisation Matters for Business

Quantisation is the enabling technology for running powerful AI models on affordable hardware. A 70-billion-parameter model that normally requires multiple expensive GPUs can run on a single GPU when quantised to 4-bit precision. This dramatically changes the economics of local AI deployment. For organisations concerned about data privacy, quantisation makes it feasible to run capable models entirely on-premises without sending data to cloud APIs. For cost-conscious deployments, it reduces the GPU memory required, allowing smaller and cheaper hardware to serve the same models. Quantisation also improves inference speed — smaller numbers require fewer computation cycles, and more of the model can fit in fast GPU cache memory. This results in faster response times for end users.

Practical Applications

Quantisation is widely used in local AI deployments, where organisations run open-source models like Llama or Mistral on their own hardware. Tools like llama.cpp, Ollama, and vLLM support quantised models natively, making deployment straightforward. Edge AI applications — running models on devices like phones, IoT devices, or embedded systems — rely heavily on quantisation to fit useful models into limited memory. In cloud deployments, quantisation reduces the number of GPUs required, lowering infrastructure costs. Many production systems use quantised models for initial responses and fall back to full-precision models only when higher accuracy is needed.

Related Terms

Explore further

FAQ

Frequently asked questions

Quantisation does introduce some quality loss, but it is typically minimal. 8-bit quantisation is nearly indistinguishable from full precision for most tasks. 4-bit quantisation shows more degradation but remains suitable for many practical applications. The trade-off between size and quality is well-understood and manageable.

Quantised models can run on consumer GPUs, Apple Silicon Macs, and even CPUs. A 4-bit quantised 7B model requires roughly 4GB of RAM, making it accessible on most modern laptops. Larger quantised models (30B-70B) typically need GPUs with 16-48GB of VRAM.

For GPU inference, GPTQ and AWQ are popular choices. For CPU or mixed CPU/GPU inference, GGUF (used by llama.cpp and Ollama) is the standard. The best choice depends on your hardware, deployment tool, and whether you prioritise speed or quality.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.