I'm using Mixtral 8x7b, which is a Mixture of Experts model. I'm using it to translate low-resource languages, and getting decent results.
The option is given (in LM Studio) to "use" 0-8 experts. I'm unclear on the semantics of this option. When I use 2, I get great results. When I use 1 or 3 (or 8...) I get less good results. Improvement isn't linear with increase of experts - 2 seems to be the sweet spot.
What are the semantics of the "Experts to Use" option, in this context, and what would explain 2 being an optimal number?
This reddit post should answer this
and to understand what mixture of experts is, you can look here