Auto-scaling
Auto-scaling automatically adjusts the number of AI model instances or compute resources based on real-time demand, scaling up during peak traffic and down during quiet periods to optimise cost and performance.
What is Auto-scaling?
Why Auto-scaling Matters for Business
Related Terms
Explore further
FAQ
Frequently asked questions
GPU-based services typically take 2-5 minutes to scale up due to instance provisioning and model loading time. Pre-warmed instances can respond faster. CPU-based services scale in seconds. Plan for scaling latency in your architecture.
Scale-to-zero eliminates costs during idle periods but introduces cold-start latency when the first request arrives. It is suitable for internal tools with infrequent use. For customer-facing services, maintaining minimum instances is usually preferred.
Common triggers include request queue depth (most responsive), GPU utilisation (resource-based), and response latency (user-experience-based). Combining multiple metrics provides the most reliable scaling behaviour.
Need help implementing this?
Our team can help you apply these concepts to your business. Book a free strategy call.