I am attempting to perform an image classification task on a dataset with šæ classes. The network I am using is divided into a feature extractor and a classifier. When an image is passed through the feature extractor, a feature vector is extracted, and when this vector is passed to the classifier, it produces the classification result.
Now, assuming the network is perfectly trained, I have heard that for an image belonging to class š, the output feature vector from the feature extractor is not represented by a Gaussian distribution with a mean š_a and variance š_a^2, but rather as a mixture of šæ different Gaussian distributions. I find this concept difficult to understand.
Why does the distribution of features obtained from the feature extractor follow a Gaussian mixture, not a single Gaussian distribution?
That is very subjective. There are may kinds of feature extractors.
But, if the main objective is classification, what you expect in a feature extractor is to find a feature that maximize the amount of information kept in a single feature, and especially the part of the information that permit classification.
So, you can see, for this task, an ideal feature as a "mini-classifier" of its own.
If you are trying to build an algorithm to classify animals images into L species, some ideal features an algorithm could rely on could be "number of wings", "presence of a beak", "number of legs", or even non discrete ones such as "length of the tail" (of course, it is not that kind of high level features that are extracted in real algorithms. But that it is how it is often described, with features that make sense on their own. In reality, features don't need to make sense on their own, but still, they need to carry, like those, compact and discriminant information)
Then from those feature you can try to decide which animal it is (if the features were that high-level, with that much self-meaning, a decision tree would do).
So, you would expect distribution of this features to be quasi-binary (or discrete at least).
You would expect "number of wings" to be either 0 or 2.
Even non-discrete ones (anyway, my features examples are mostly discrete because I described some that make sense from a human point of view, but in reality, of course, it is more some quantitative computation result, such as what you get from a discriminant component analysis, a distance from a k-means computed cluster, etc.)
"Tail length" for example: if you were to conceive, manually an animal classifier, and decided to use "tail length" as a feature, that would be because you expect tail length to be a good discriminant to discriminate some animals from each other. In other words, that you expect the distribution of the tail length to have several distinct values, to be "almost discrete".
For example, you may expect to say that if a 4 legged furry cute animal has a tail of 20 cm it is a cat, and if its tail is 5 cm, it is a rabbit. But that criteria is valid because, at least on a subset of images (those of cats and rabbits), there are lot of 5 cm tails (with some variations around that mean) and lot of 20 cm tails (with some variations around that mean).
So a distribution made of a mixture of 2 gaussians: the gaussian of the distribution of rabbit tail length, and the one of cat tail length.
On the contrary, a feature that would be a single gaussian would not help a lot to discriminate. It might be a good feature on some other level. It may for example be the principal component of a PCA, and summarize on its own half of the information. But that information is not really helpful for a classifier.
For example, for images, the principal components if often just the average luminosity (because a well lighted photo tends to have all their pixels brighter, and a dark one all their pixels darker. So, there is a huge correlation between all pixels that is just the manifestation of the common factor "lighting" in all of them). So whatever the dataset you use, a PCA on the dataset is very likely to have as first component a component that is more or less just the general lighting, an average of all pixels.
It is the principal component. But not at all a discriminant one. It doesn't help at all to classify your images. Even if the information were relevant, what threshold could you use to discriminate from a single gaussian?
Not saying that all features are "almost discrete", that is some Gaussian mixtures. Just that if you had to write a classifier without ML, you would probably rely yourself on features that have more that one "attractor" (such as 5cm and 20cm for my "tail length" example. If all animals had a tail length of 15 cm with a standard deviation around this mean, it would be a less ideal feature to use)
So, in short, you expect your ideal features for a classifier, to be mini-classifiers, or subclassifiers of their own, able to "classify", animals into subset of classes, such as "short tailed", "long tailed", "no tail", categories.