Load Balancing
Load balancing distributes incoming AI requests across multiple model instances or servers to optimise resource utilisation, minimise latency, and ensure high availability.
What is Load Balancing for AI?
Why Load Balancing Matters for Business
Related Terms
Explore further
FAQ
Frequently asked questions
Least-connections or least-latency strategies typically work best for LLM workloads because request processing times vary significantly. Round-robin can lead to uneven load when some requests take much longer than others.
Streaming responses require persistent connections, so the load balancer must support WebSocket or SSE protocols. Once a streaming connection is established, it remains on the same instance for the duration of the response.
Yes. AI gateways and custom routing layers can distribute requests across providers like OpenAI, Anthropic, and self-hosted models. This provides cost optimisation (using cheaper models when possible) and resilience (failing over between providers).
Need help implementing this?
Our team can help you apply these concepts to your business. Book a free strategy call.