machine-learningstatisticsclassificationprobabilitymulticlass-classification

Does the number of classes in the target variable affect the accuracy of a classification model?


From my own experience, I noticed that the accuracy of a classification model inversely changed with the number of classes in the target variable. That is, the more number of classes in the dependent variable, the lower the model's accuracy. I don't know whether that change was caused by the number of classes or by the imbalances between them (although oversampling technique did help improve the model's performance a bit). I assume that because a larger number of classes leads to a smaller difference of probabilities between them, thus it is harder for a model to "confidently" determine the exact class.

Is there a more concrete theoretical basis to explain the observation above?


Solution

  • The easiest way to understand what accuracy "means" wrt. the number of classes is by considering a random baseline. Tossing a coin gives you 1/K accuracy where K is the number of classes. So 50% for 2 classes, but just 10% for 10, and just 1% for 100.

    This shows that "60%" accuracy "means more" if you have more classes: a binary classifier with 60% accuracy is almost random, but achieving 60% accuracy for 100 classes is godlike (assuming the classes are relatively balanced).