The sudden rise of Chinese AI start-up DeepSeek has taken the AI industry by surprise. Despite being just two years old, the company's large language models (LLMs) are on par with those of AI giants like OpenAI, Google DeepMind, xAI, and others. DeepSeek’s R-1 and V-3 models have outperformed OpenAI’s GPT-4o and O3 Preview, Google’s Gemini Pro Flash, and Anthropic’s Claude 3.5 Sonnet across various benchmarks.
Beyond being a Bay Area outsider in the race to build a strong foundational model, what makes DeepSeek’s achievement even more remarkable is that the training cost of its models is significantly lower than that of OpenAI or Google.
DeepSeek’s ability to remain highly economical while maintaining efficiency and accuracy is due to its use of the Mixture of Experts (MoE) architecture. The R-1 and V-3 models are built on MoE architecture with 671 billion parameters, enabling them to deliver highly efficient outputs while significantly reducing computational training costs.
What is MoE?
MoE is a machine learning technique that improves the system’s efficiency by distributing tasks among multiple specialised models (or 'experts'). Instead of having one large model handle everything, MoE selects the most relevant experts for each input using a 'gating network,' which optimises performance and reduces computational costs.
Imagine this: You have a panel of experts in front of you, each specialising in a particular domain. A moderator coordinates the panel, acting as a bridge between you and them. Based on your question or prompt, the moderator selects the expert best suited to respond according to their specialisation.
For instance, if you have a question about medicine, the moderator directs it to the medical expert, who then provides you with a detailed and accurate response. This approach is far more effective than relying on a single individual with generalised knowledge across all domains.
This is exactly how MoE architecture works. The ‘experts’ are small models trained on specific datasets, while the moderator is the gating network responsible for selecting the best model for a given response.
It is important to note that MoE architecture is not a new concept. It was developed in the 1990s and has since been deployed for various AI applications. Tech companies like Google DeepMind have utilised MoE in models such as Switch Transformer to enhance scalability and efficiency.
How Does MoE Work?
MoE operates in two phases: training and inference. In the training phase, individual experts are trained on specialised data subsets, enabling them to focus on different aspects of their respective domains. A gating network is trained alongside them to learn how to select the most relevant experts for each input.
The entire system then undergoes a joint training, where both the experts and the gating network are optimised together to enhance overall performance.
In the inference phase, the gating network directs inputs to the most suitable experts based on learned probability distributions. To ensure efficiency, only a few experts are activated per input. Their outputs are then merged, typically through weighted averaging, to produce a final result.
This selective activation allows MoE to scale efficiently while maintaining accuracy, making it a viable solution for managing large AI models.
Difference Between MoE and Monolithic Architecture
The key difference between Mixture of Experts (MoE) architecture and monolithic architecture in AI lies in how they distribute computational workload and model parameters.
MoE consists of multiple specialised models, known as 'experts,' with a gating mechanism that selects the most relevant experts for each input. In contrast, monolithic architectures rely on a single, unified model where the same set of parameters is used for all inputs, regardless of their complexity.
MoE architectures enhance efficiency by activating only a subset of experts for each input, reducing computational costs while maintaining high parameter capacity. In contrast, monolithic architectures engage all parameters at once, making them simpler but increasingly expensive as model size grows.
MoE improves performance by allowing experts to specialise in different types of data, making it ideal for multi-task and multimodal learning. Monolithic models, however, learn all patterns within a single structure, which can limit scalability and efficiency in handling multiple objectives.
Why is MoE not Widely Adopted?
Even though the MoE architecture generally offers superior flexibility and scalability compared to a monolithic architecture in AI, most start-ups initially choose a monolithic approach as it allows for faster development, simpler deployment, and is better suited for smaller, early-stage projects where managing multiple expert models may be unnecessarily complex.
AI start-ups often avoid MoE due to its complicated implementation, which requires specialised infrastructure and advanced training techniques that many lack the resources to develop. While MoE reduces inference costs, it increases training complexity and expenses, demanding large-scale infrastructure and engineering efforts.
Additionally, optimisation and debugging are more challenging, with issues such as expert under-usage and mode collapse making training unpredictable. It also has inference latency, as selecting and routing to experts adds extra load on the system, making it less suitable for real-time AI. With limited pretrained models and tooling, start-ups tend to favour monolithic architectures for their simplicity, though some transition to MoE as they scale.