GroveAI
Glossary

Knowledge Distillation

Knowledge distillation is a technique that transfers the knowledge of a large, powerful AI model (the teacher) to a smaller, faster model (the student), enabling efficient deployment without a proportional loss in quality.

What is Knowledge Distillation?

Knowledge distillation is a model compression technique where a smaller 'student' model is trained to mimic the behaviour of a larger 'teacher' model. Rather than training the student model on raw data alone, it learns from the teacher's outputs — including the probability distributions over possible answers, which contain richer information than simple labels. The key insight is that the teacher model's 'soft labels' (probability distributions over all possible outputs) contain more information than 'hard labels' (the single correct answer). For instance, when classifying an image of a cat, the teacher might output 80% cat, 15% dog, 5% other — revealing that cats and dogs are visually similar. This nuanced information helps the student learn more effectively. Distillation can be applied at various levels: output-level (matching final predictions), feature-level (matching internal representations), and attention-level (matching attention patterns). It can also be combined with other compression techniques like quantisation and pruning for even greater efficiency gains.

Why Distillation Matters for Business

Distillation enables businesses to deploy high-quality AI models in resource-constrained environments. A distilled model might achieve 95% of the quality of a large model while being 10x faster and 50x cheaper to run. This makes AI viable for edge devices, real-time applications, and high-volume services. The economics are compelling. Running a large language model for every customer query can be prohibitively expensive at scale. A distilled model trained to handle common queries can serve the majority of requests at a fraction of the cost, with the larger model reserved for complex cases. Distillation is also used by AI providers to create model families at different price and capability points. Understanding this helps businesses select the right model tier — using distilled models for routine tasks and larger models for tasks requiring maximum capability.

FAQ

Frequently asked questions

Quality loss varies by task and compression ratio. A well-distilled model typically retains 90-98% of the teacher's performance. For specific, well-defined tasks, distilled models can sometimes match the teacher's quality because they learn to specialise.

This depends on the provider's terms of service. Many providers prohibit using their model outputs to train competing models. However, some allow distillation for internal use or offer distillation as a service. Always check the licence terms.

Quantisation reduces precision of an existing model's weights (simpler, smaller gains). Distillation creates an entirely new, smaller model (more complex, larger gains). They can be combined — distil first, then quantise the resulting model for maximum efficiency.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.