Inference
Inference is the process of using a trained AI model to generate predictions or outputs from new input data — it is the production phase where the model does its actual work for end users.
What is Inference?
How Inference Works
Why Inference Matters for Business
Optimisation Strategies
Related Terms
Explore further
FAQ
Frequently asked questions
Training is the process of teaching a model by adjusting its parameters on large datasets — this happens once or periodically. Inference is using the trained model to generate outputs from new inputs — this happens continuously in production. Training requires more compute but happens less frequently; inference is less intensive per query but runs at scale.
LLMs generate text one token at a time, and each token requires a forward pass through billions of parameters. For long responses, this means billions of calculations repeated hundreds of times. At scale, these costs add up. Techniques like quantisation and caching help reduce costs significantly.
Yes, particularly for smaller or quantised models. Frameworks like llama.cpp enable efficient CPU inference, and some models are specifically optimised for CPU deployment. GPUs remain faster for larger models, but CPU inference is a viable and cost-effective option for many use cases.
Need help implementing this?
Our team can help you apply these concepts to your business. Book a free strategy call.