I’m building a text-to-speech model. I was wondering if my training dataset should be “realistically” distributed (i.e. same distribution as the data it will be used on), or should it be uniformly distributed to make sure it performs well on all kinds of sentences.
I’d say that this depends on the dataset size. If you have a really, really small dataset, which is common in some domains and rare in others, then you’d want to ensure that all the “important kinds of data” (whatever that means for your task) would be represented there even if they’re relatively rare, but a realistic distribution is better if you have a large enough dataset that all the key scenarios would be adequately represented anyway.
Also, if mistakes on certain data items are more important than others (which is likely in some domains), then it may make sense to overrepresent them, as you’re not optimizing for the average case of the real distribution.
There’s also the case of the targeted annotation where you look at the errors your model is making and specifically annotate extra data to overrepresent those cases - because there are scenarios where some types of data happen to be both very common and trivial to solve, so adding extra training data for them takes effort but doesn’t change the results in any way.