performancecpu-architecturehyperthreading

How to judge if a given workload is hyper-thread friendly?


How do I tell if a certain working set is suitable for hyper-threading, other than just turning the hyper-threading option on/off and doing a direct performance test?

There are two main problems:

  1. Can the workload itself benefit from a machine with hyper-threading enabled?
  2. If the current workload and other workloads are deployed at the same time, what is the potential benefit if hyper-threading enabled?

Can TMA or other hardware indicators be used to characterize how friendly the workload is to hyper-threading?


Solution

  • If you're considering running some parallel workload with more of fewer threads, by far the most reliable way is to just try it.


    SMT = Simultaneous Multithreading; Hyperthreading is a trademark for Intel's implementation of that idea. AMD, PowerPC, and others that have implemented it, may use different names for their own version. The computer architecture concept is the same. (But different microarchitectures have more or fewer execution units relative to the pipeline width, which can affect how good SMT typically is1.)

    Paul DeMone's article on realworldtech from 2000 about Alpha EV8 is a good introduction to SMT as a computer-architecture concept. (Alpha EV8 was cancelled before release; many of the engineers who worked on it were hired by Intel and went on to work on Pentium 4 with Hyperthreading.)


    One indicator is uops/clock significantly lower than the pipeline width (the issue/rename bottleneck is the narrowest part of all pipelines), unless the bottleneck is actually a single execution unit (like the imul/popcnt unit, or FP math execution units, or something). e.g. Skylake's pipeline is 4 uops wide. (Instructions per cycle is usually about the same as uops per cycle; most instructions are 1 uop, some are more. But cmp/jcc can macro-fuse into a single uop.)

    High branch mispredict rate is also "good" (for SMT being able to help), since one thread can be doing work while the other is recovering. Something bottlenecked by latency of long dependency chains would also benefit a lot, like an FP dot-product that's not unrolled with multiple accumulators would also benefit. (e.g. one add or FMA per 4 cycles, when the result of the previous is ready, rather than 2 per cycle.)

    On Linux, with perf for hardware performance counters, the following command will count many of the relevant counters. uops_issued.any exists on Skylake and probably other Intel; other microarchitectures will probably use different names. The other counter names are generic ones that perf will map to any

    perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,uops_issued.any   ./a.out
    

    Cache misses can go either way: two threads per core competing for the same L1d / L2 cache (and more threads competing for shared L3) can end up making things worse. But if there are a lot of stalls due to cache miss latency, SMT is good at hiding that.

    Perhaps counters like cycle_activity.stalls_l2_miss are good indicators (Execution stalls while L2 cache miss demand load is outstanding), although that can't predict whether you'll have a lot more cache misses with 2 threads per core. resource_stalls.any will count cycles where the back-end is the problem (front end has uops ready to issue/rename, but back end can't accept them because the ROB (Reorder Buffer) or RS (Reservation Station = scheduler) is full.).


    SMT can still help some in some workloads that are already fairly high throughput (in uops/clock). For example, video encoding with x265 runs over 3 uops/clock on my Skylake i7-6700k, but still gets about a 15% speedup from using 8 threads instead of 4.

    But limiting it to just 4 threads (number of physical cores) is significantly better for interactive use of my Linux desktop while it's running, like loading web pages in chromium feels sluggish when a video encode job is running with all 8 logical cores busy. That might be due to it using quite a bit more memory bandwidth, and having a larger L3 cache footprint (more 1080p frames is a larger memory footprint with different threads working on different frames).

    But it also might be due to scheduling of threads for the web browser and X server having to wait for a free core, instead of there always being a free logical core already when only 4 heavily-active threads are running. (Linux's scheduler is SMT-aware, so the 4 encode threads run on separate physical cores the majority of the time.) I haven't investigated by e.g. turning off hyperthreading in the BIOS or using taskset to limit a web browser (and X server?) to the same 4 cores.


    Footnote 1: For example, a mix of SIMD/FP and integer threads on AMD Zen CPUs should do well; they use separate scheduling queues for the different domains and have lots of execution ports. So a mix of integer and SIMD/FP work can make good use of both schedulers, although the total ROB size has to be shared between logical cores. (Usually statically partitioned.)

    By comparison, early P4 CPUs were so weak, especially with their small L1i trace cache, that usually it was a slowdown to use hyperthreading for computationally intensive code. And unlike modern CPUs, their hardware prefetching was so weak that it sometimes helped to run a thread that prefetches an array the other logical core is looping over and doing computation on. Even in cases where software-prefetch can be helpful, running a prefetch thread is not a thing anymore. (How much of ‘What Every Programmer Should Know About Memory’ is still valid?)