Mixture of Experts (MoE)
Mixture of Experts is a neural network architecture that divides the model into specialised sub-networks (experts) and uses a routing mechanism to activate only the most relevant experts for each input, achieving high capability with lower computational cost.
What is Mixture of Experts?
How MoE Works
Why MoE Matters for Business
Practical Implications
Related Terms
Explore further
FAQ
Frequently asked questions
While OpenAI has not officially confirmed it, multiple credible reports and analyses indicate that GPT-4 uses a MoE architecture. This would explain its high capability combined with relatively manageable inference costs compared to what a dense model of equivalent performance would require.
Yes. All expert parameters must be stored in memory, even though only a subset is used per input. A 141B MoE model needs memory for all 141B parameters, not just the 47B that are active. However, the compute cost per inference is based on the active parameters, which is the source of MoE's efficiency advantage.
Yes, though it requires some additional considerations. LoRA and other parameter-efficient techniques work with MoE models. Some approaches fine-tune only the router or specific experts rather than all experts, which can be more efficient and targeted.
Need help implementing this?
Our team can help you apply these concepts to your business. Book a free strategy call.