GroveAI
Glossary

Inference

Inference is the process of using a trained AI model to generate predictions or outputs from new input data — it is the production phase where the model does its actual work for end users.

What is Inference?

In machine learning, inference refers to the process of running a trained model on new data to produce outputs — whether that is generating text, classifying images, translating languages, or making predictions. If training is where a model learns, inference is where it applies what it has learned. Every time you interact with a chatbot, use AI-powered search, or receive a recommendation from an AI system, you are triggering an inference. The model processes your input through its learned parameters and generates a response. This is the phase that end users experience directly, making inference performance a critical concern for production AI systems.

How Inference Works

During inference, input data is converted into the model's expected format (such as tokens for language models), passed through the model's neural network layers, and transformed into an output. For language models, this process happens one token at a time — the model generates each word or sub-word sequentially, with each generation step considering all previous tokens. Inference speed is measured in tokens per second for language models, or latency (time to first response) and throughput (total requests handled per second) for systems more broadly. These metrics directly impact user experience and system cost. Hardware plays a significant role in inference performance. GPUs and specialised AI accelerators (like NVIDIA's TensorRT or Apple's Neural Engine) are typically used to run inference at production speeds. The choice between cloud-hosted and on-premises inference infrastructure involves trade-offs between cost, latency, data privacy, and scalability.

Why Inference Matters for Business

Inference costs represent the largest ongoing expense for most AI deployments. While training is a one-time (or periodic) cost, inference runs continuously every time a user interacts with the system. For high-volume applications like customer support chatbots or search engines, inference costs can scale rapidly. Optimising inference — through techniques like quantisation, batching, caching, and model distillation — can dramatically reduce per-query costs without significantly impacting quality. The difference between an optimised and unoptimised inference pipeline can be an order of magnitude in cost. Inference latency also directly affects user experience. Users expect responses within seconds, and any delay impacts engagement and satisfaction. Balancing response quality against speed is a key design decision for production AI applications.

Optimisation Strategies

Several techniques are used to optimise inference in production. Quantisation reduces model precision from 32-bit to 16-bit, 8-bit, or even 4-bit numbers, dramatically reducing memory requirements and increasing speed with minimal quality loss. Model batching processes multiple requests simultaneously to maximise GPU utilisation. Caching stores responses to common queries so they can be returned instantly without running the model. Speculative decoding uses a smaller, faster model to draft responses that a larger model then verifies, combining speed with quality. Each technique offers different trade-offs and can be combined for maximum benefit.

FAQ

Frequently asked questions

Training is the process of teaching a model by adjusting its parameters on large datasets — this happens once or periodically. Inference is using the trained model to generate outputs from new inputs — this happens continuously in production. Training requires more compute but happens less frequently; inference is less intensive per query but runs at scale.

LLMs generate text one token at a time, and each token requires a forward pass through billions of parameters. For long responses, this means billions of calculations repeated hundreds of times. At scale, these costs add up. Techniques like quantisation and caching help reduce costs significantly.

Yes, particularly for smaller or quantised models. Frameworks like llama.cpp enable efficient CPU inference, and some models are specifically optimised for CPU deployment. GPUs remain faster for larger models, but CPU inference is a viable and cost-effective option for many use cases.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.