large-language-modelmistral-7bmixtral-8x7bmixture-of-experts-modellm-studio

What is the meaning of "Experts to Use" in a Mixture-of-Experts model?


I'm using Mixtral 8x7b, which is a Mixture of Experts model. I'm using it to translate low-resource languages, and getting decent results.

The option is given (in LM Studio) to "use" 0-8 experts. I'm unclear on the semantics of this option. When I use 2, I get great results. When I use 1 or 3 (or 8...) I get less good results. Improvement isn't linear with increase of experts - 2 seems to be the sweet spot.

What are the semantics of the "Experts to Use" option, in this context, and what would explain 2 being an optimal number?


Solution

  • This reddit post should answer this

    https://www.reddit.com/r/LocalLLaMA/comments/18h5p1v/mixtral_still_works_well_when_using_1_expert_per/

    and to understand what mixture of experts is, you can look here