Mixture of Experts (MoE): Gain effective results from LLMs without additional training
👷♂️ Software Architecture Series — Part 26.
💹 Struggling with the cost of additional training of your LLMs?
🕵♂️ Try Mixture of Experts (MOEs).
Background:
With new LLMs with larger parameters coming to the scene every other day, the computational cost rises. More parameters mean more computations are needed to adjust the weights during the training process or Inference. This also increases the need for memory to store model weights and activation parameters.
While training LLMs, distributed training techniques like data parallelization are leveraged to train models across multiple processing units. However, as the model size increases, the overhead associated with synchronizing updates between different processing units also increases, which in turn increases computational cost.
Why MOE works?
LLMs like Grok-1, Mixtral 8x7B, etc. use Mixture of Experts technique, where computation work is distributed across multiple experts. This process can be parallelized across different processing units, which obviously lead to faster training. But, what makes this technique fare well in terms of reducing the computational cost is the usage of experts.
Rather than using dense feed forward network (FNN) layers, models use sparse layer of several, similarly-structured neural networks, where each expert is just a feed-forward neural network with its own independent set of parameters.
The basic idea is pretty simple. The different experts specialize in different aspects of the data or different parts of the input space. By combining the predictions of these experts, the model can achieve better overall performance than any single expert alone.
However, for an input space, opinions of all the experts may not be required.
For example, if the query is about sports and in particular Tennis, the Sports LLM does not need to leverage its Football expert to help formulate answer about the question specific to Tennis.
Hence, for a given set of input, a routing mechanism is used to to sparsely select a set of experts to which each token will be sent. A router is used to decide which tokens are sent to which experts. This router is composed of learned parameters and pre-trained with rest of the model. During pretraining, the parameters of the router are initialized randomly, and then updated through backpropagation using a training dataset. This allows the router to learn to effectively route tokens to experts based on the patterns in the data. Moreover, Dynamic routing strategies are leveraged to allow the model to adaptively allocate resources based on the input data and the current state of the model. Some of the common algorithms used in routing are K-means cluster, Hashing or linear assignment to maximize token-expert affinities.
By using only a subset of parameters during inference (sparsity), MoE backed models can significantly reduce computational costs without sacrificing prediction accuracy. This approach is particularly useful for deploying large-scale models like Grok-1, which has a massive number of parameters.
Moreover, in recent advancements, Computer Scientists are working on strategy to share parameters (information) at lower level between experts. Instead of each expert having its own set of parameters, they share a common set, which leads to significant memory savings, especially in large-scale models with many experts. By encouraging the experts to share parameters among themselves, the LLM is able to learn more robust and generalizable representations that capture common patterns in the data, rather than memorizing specific instances.
It is important to note that even though MoE models offer advantages in terms of achieving compute efficiency during pretraining due to their ability to distribute computational resources across multiple experts, these models have historically faced challenges in generalizing during fine-tuning, which can lead to overfitting. Fine-tuning involves adapting the pretrained model to a specific task or dataset by further training it on task-specific data. MoE models may face difficulty in capturing task-specific patterns or nuances, particularly if the distribution of expertise required for the target task differs from that of the pretraining data. There are several techniques related to data augmentation and regularization, which are being researched to mitigate this issue, however, a general consensus about a solution has not been made so far.
So, this leads to the question that when to use MoE models rather than regular models with dense parameters?
Sparse MoE models are beneficial in scenarios where high throughput is crucial with a fixed training budget, such as large-scale deployment across many machines or serving a high volume of requests concurrently. Their ability to distribute computations across multiple experts and leverage parallel processing makes them well-suited for handling large-scale tasks efficiently. In contrast, Dense models may be preferable in scenarios where throughput is less critical, such as research experiments or applications with low traffic volume. It is important to note that the available MoE models have higher memory requirements than dense models.
Additional Reading:
https://stackoverflow.blog/2024/04/04/how-do-mixture-of-experts-layers-affect-transformer-models/?utm_campaign=the-overflow-newsletter