GroveAI
Glossary

LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning technique that trains a small set of adapter weights instead of modifying all model parameters, making it possible to customise large AI models quickly and affordably.

What is LoRA?

LoRA (Low-Rank Adaptation) is a fine-tuning technique introduced by Microsoft researchers in 2021 that dramatically reduces the computational resources needed to adapt large language models. Instead of updating all of a model's billions of parameters during fine-tuning, LoRA freezes the original model and trains a small number of additional parameters — typically less than 1% of the original model's size. These additional parameters, called adapters, are structured as low-rank matrices that modify the model's behaviour. The result is a lightweight set of adapter weights (often just a few megabytes) that can be loaded on top of the base model to change its behaviour for a specific task, without altering the original model itself.

How LoRA Works

LoRA works by decomposing the weight updates during fine-tuning into two smaller matrices (low-rank decomposition). Instead of learning a full update matrix with millions of parameters, LoRA learns two much smaller matrices whose product approximates the full update. This dramatically reduces the number of trainable parameters while preserving most of the fine-tuning benefit. The rank parameter (r) controls the adapter's capacity — higher ranks can capture more complex adaptations but require more memory and compute. In practice, ranks of 8-64 work well for most tasks, representing a tiny fraction of the original model's parameters. QLoRA extends this approach further by quantising the base model to 4-bit precision before applying LoRA adapters. This makes it possible to fine-tune a 70-billion-parameter model on a single consumer GPU with 24GB of VRAM — a task that would otherwise require a cluster of expensive hardware.

Why LoRA Matters for Business

LoRA democratised model customisation. Before LoRA, fine-tuning a large language model required expensive GPU clusters and significant engineering expertise. With LoRA, organisations can create custom models on modest hardware in hours rather than days, at a fraction of the cost. This has practical implications for businesses that need domain-specific AI. A law firm can fine-tune a model to understand legal terminology and conventions. A healthcare provider can adapt a model to their clinical workflows. A software company can create models that follow their coding standards — all without massive infrastructure investments. LoRA also enables organisations to maintain multiple task-specific adaptations of the same base model. Since adapters are small, a single base model can be paired with different adapters for different tasks, reducing the total infrastructure required.

Practical Applications

LoRA is the most widely used fine-tuning technique for large language models. It is used to adapt models for specific output formats (structured JSON, medical reports, legal briefs), teach domain-specific terminology and reasoning patterns, adjust model tone and style to match brand guidelines, and improve performance on specialised tasks like code generation in specific languages. The adapter-based approach also enables model personalisation — different users or departments can have different LoRA adapters loaded for the same base model, each tailored to their specific needs. This is more efficient than running multiple copies of a full model.

FAQ

Frequently asked questions

LoRA fine-tuning can be done on a single consumer GPU (16-24GB VRAM) in a few hours for most tasks. Cloud compute costs are typically under 100 USD for a training run. The main investment is in data preparation rather than compute.

LoRA trains adapter weights on a full-precision base model. QLoRA quantises the base model to 4-bit precision first, then applies LoRA adapters. QLoRA uses significantly less memory, making it possible to fine-tune larger models on smaller GPUs with comparable results.

Yes. Multiple LoRA adapters can be merged together or switched between at inference time. This enables maintaining a library of task-specific adapters that can be applied to a single base model as needed, without reloading the entire model.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.