GroveAI
Glossary

Batch Size

Batch size is a training hyperparameter that determines how many data samples are processed before the model's weights are updated, affecting training speed, memory usage, and model quality.

What is Batch Size?

Batch size refers to the number of training examples used in one iteration of model weight updates during neural network training. Rather than updating weights after every single example (stochastic gradient descent) or after processing the entire dataset (full batch gradient descent), most modern training uses mini-batches — processing a fixed number of examples before performing a weight update. For example, if you have a training dataset of 10,000 examples and a batch size of 32, each epoch (complete pass through the data) would consist of approximately 313 weight updates, each based on the average gradient computed across 32 examples. Batch size is a fundamental hyperparameter that affects multiple aspects of training. Larger batches provide more stable gradient estimates but require more memory and can sometimes lead to poorer generalisation. Smaller batches introduce more noise into the gradient estimates, which can actually help the model escape poor local minima and find better solutions.

Why Batch Size Matters for Business

Batch size directly impacts the cost, speed, and quality of model training. For organisations fine-tuning or training custom models, understanding this trade-off is essential for managing compute budgets and achieving good results. Larger batch sizes make better use of GPU parallelism, potentially speeding up training. However, they require more GPU memory, which may necessitate more expensive hardware. They can also require more careful tuning of the learning rate. Smaller batch sizes use less memory and can sometimes produce better-performing models, but may take longer to converge. In practice, batch size is often determined by hardware constraints — teams choose the largest batch size that fits in available GPU memory, then adjust the learning rate accordingly. Techniques like gradient accumulation allow teams to simulate larger effective batch sizes on limited hardware by accumulating gradients across multiple forward passes before updating weights.

FAQ

Frequently asked questions

Common batch sizes are powers of 2 (32, 64, 128, 256) for hardware efficiency. The optimal choice depends on dataset size, model architecture, and available memory. Starting with 32 or 64 and adjusting based on training behaviour is a reasonable approach.

Larger batches process more data per step and can better utilise GPU parallelism, but they may require more steps to converge and careful learning rate adjustment. The relationship between batch size and wall-clock training time is not always straightforward.

Gradient accumulation is a technique that simulates larger batch sizes by processing multiple smaller batches and accumulating their gradients before performing a weight update. This allows training with effectively large batches on hardware with limited memory.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.