When you split an algorithm/function/whatever to run as separate threads, let's say I launch 8 threads, you don't know that each thread is going to run on a separate one of my 8 cores, as it's the job of the scheduler to decide which threads are given to which cores. Still, if I have 8 cores and I do split a job into 8 threads, I pretty much expect that to happen, each of my 8 cores (roughly) will take one eighth of the workload. In the case of Intel processors that have P and E cores (performance and efficiency cores), the P cores might be clocked at 5.4Ghz and the E cores might be clocked at 4.2Ghz. Does this type of processor with two different types of processors make multithreaded programming more unpredictable or less desirable? The two-tier system is common in other devices like smartphones and Apple CPUs, and the same question applies. As a programmer how are you supposed to account for the fact that when you run something on a different thread, say you spawn a new thread or another thread is waiting in a thread pool for a job, it may run on a performance core or an efficiency core? Do you have any choice?
If you divide your workload into equal-size parts then the P-cores will finish first. But if you divide into smaller parts and have threads grab another chunk when they're done their first, like OpenMP schedule=dynamic instead of static, you can keep all cores busy until all the work is done.
Or if there are lots of parallelizable tasks to be done, and later ones can start while some threads of the first are still finishing, that makes it easy to sent work to a thread pool.
Dividing your work into 8 equal-sized parts for an 8-core CPU can be bad even on a homogeneous CPU if there's any other load: if some threads are descheduled for a while they won't finish as early as threads that ran the whole time. (Especially if the total time is on the same scale as a scheduling granule, e.g. 10 ms for Linux with HZ=100.)
So there's already reason to divide up work into moderate-size chunks for threads to consume, especially if you're using a sophisticated thread-pool system like OpenMP which can do that for you without having to write a lot of extra code.