GroveAI
Glossary

Mixture of Experts (MoE)

Mixture of Experts is a neural network architecture that divides the model into specialised sub-networks (experts) and uses a routing mechanism to activate only the most relevant experts for each input, achieving high capability with lower computational cost.

What is Mixture of Experts?

Mixture of Experts (MoE) is an architecture where a model contains multiple specialised sub-networks ("experts") rather than a single monolithic network. For each input, a learned routing mechanism selects a subset of experts to process the data, while the remaining experts stay inactive. This means only a fraction of the model's total parameters are used for any given input. The result is a model that has the knowledge capacity of a very large network but the computational cost of a much smaller one. A MoE model with 100 billion total parameters might only activate 15 billion for each token, delivering performance comparable to a dense 100B model at the cost of a 15B one.

How MoE Works

In a MoE transformer, certain layers (typically the feed-forward layers) are replaced with multiple expert networks and a gating (routing) mechanism. When a token is processed, the router evaluates which experts are most relevant and directs the token to the top-k experts (usually 1-2 out of 8-16 total). The router is trained alongside the experts, learning to specialise different experts for different types of inputs. Some experts may become specialists for code, others for mathematics, others for different languages or domains. This emergent specialisation allows the model to develop deep expertise across many areas without activating all of that knowledge for every query. Notable MoE models include Mixtral (by Mistral AI), which uses 8 experts and activates 2 per token, and GPT-4, which is widely believed to use a MoE architecture. The approach has become increasingly popular as a way to scale model capability without proportionally scaling inference cost.

Why MoE Matters for Business

MoE architectures offer a compelling cost-performance trade-off. Organisations get access to models with the knowledge and capability of very large networks while paying the inference cost of much smaller ones. This makes high-capability AI more accessible and affordable. For businesses evaluating AI models, understanding MoE is important because it explains why some models offer surprisingly good performance relative to their active parameter count. A MoE model with 47 billion active parameters (out of 141 billion total) can outperform dense models several times its active size. However, MoE models have trade-offs. They require more total memory (to store all experts, even if only some are active) and can be more complex to deploy and optimise. Understanding these trade-offs helps in making informed infrastructure and model selection decisions.

Practical Implications

When deploying MoE models, organisations should be aware that memory requirements are based on total parameters, not active parameters. A model with 141B total parameters needs the memory to store all those parameters, even though only 47B are active per inference. This can be mitigated through quantisation and expert offloading techniques. MoE models also tend to have different latency characteristics than dense models — they can be faster for simple queries (when fewer experts contribute) but may show more variable performance across different types of inputs. Testing with representative workloads is important for capacity planning.

FAQ

Frequently asked questions

While OpenAI has not officially confirmed it, multiple credible reports and analyses indicate that GPT-4 uses a MoE architecture. This would explain its high capability combined with relatively manageable inference costs compared to what a dense model of equivalent performance would require.

Yes. All expert parameters must be stored in memory, even though only a subset is used per input. A 141B MoE model needs memory for all 141B parameters, not just the 47B that are active. However, the compute cost per inference is based on the active parameters, which is the source of MoE's efficiency advantage.

Yes, though it requires some additional considerations. LoRA and other parameter-efficient techniques work with MoE models. Some approaches fine-tune only the router or specific experts rather than all experts, which can be more efficient and targeted.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.