How can I efficiently execute multiple C++ benchmarking algorithms using Windows cmd

I'm currently working on benchmarking algorithms in C++ on a Windows environment and seeking advice on best practices for conducting accurate performance evaluations.

My setup involves using Windows cmd to compile and run executable files for benchmarking, where each executable represents a the same algorithm but with different parameters. However, I'm concerned about potential inaccuracies that may arise from running multiple cmd instances simultaneously to execute different benchmarks concurrently.

Here are specific points I'd like guidance on:

I'm feeling fatigued from the sequential nature of running benchmarks one after another and waiting for each to complete before launching the next. Considering that each benchmarking process utilizes approximately 15% of the CPU, I'm contemplating running multiple benchmarks simultaneously across multiple cmd instances. This approach would help expedite the benchmarking process and make more efficient use of system resources.

I would appreciate insights and recommendations from experienced programmers who have expertise in benchmarking algorithms in C++ on Windows platforms. Thank you in advance for your assistance.

Solution

There's a lot of ways this can make your benchmark numbers more noisy and less repeatable. If wouldn't recommend it unless you know exactly what you're doing and what kind of workloads you're benchmarking, and/or if you want a rough approximation of per-thread performance on a busy system.

If any of your code has significant L2 cache misses (i.e. cache footprint greater than about 256K to 1 MiB or so depending on your CPU), they'll compete with each other for L3 cache space and DRAM bandwidth. (L3 bandwidth scales pretty well with multiple readers/writers, but DRAM doesn't on typical destkop/laptop CPUs; a single core can nearly saturate the memory controllers).

Benchmarking on an otherwise-idle system is much easier to make consistent and repeatable, but may or may not be representative of conditions in a real multi-threaded program.

If you do want to benchmark your code when there's competition for shared resources from other cores, it would probably be best to write a simple load-generator that e.g. loops over a large or medium-sized array at a certain speed to generate some amount of L3 and/or DRAM traffic. (Or NT stores like _mm_stream_si128 to just generate DRAM traffic without an L3 cache footprint.) This should be much more consistent than having each benchmark compete for resources against whatever the current version of whatever other code your script happens to be benchmarking at the same time.

See also Idiomatic way of performance evaluation? for other microbenchmarking gotchas, like CPU frequency warm-up, and touching memory first to avoid page-faults in the timed region. (Or making the repeat loop long enough to amortize any startup work if you want to just time or profile the whole executable, like with Linux perf stat to measure things other than time, like branch mispredicts and cache misses.)

Speaking of CPU frequency, most CPUs can boost the CPU clock higher (turbo) when only one or two cores are active vs. when most of them are. Boost clocks can also depend on overall thermal and power limits, so laptops might be especially sensitive to this. Counting core clock cycles instead of time can be useful for code that doesn't sleep or wait on I/O, but Windows doesn't make that easy. (Unlike on Linux where perf is available, except in a VM.)

Also, if your CPU has SMT (e.g. Hyperthreading) with multiple logical cores per physical core, Windows might end up scheduling multiple tasks on sibling cores so they're competing for execution resources inside the CPU. (Especially if you run more tasks than physical cores, but even without that Window's scheduler bounces threads around a lot, last I heard).

Depending on the task, sharing a physical core could make them run close to half speed. (Hopefully not that bad most of the time, if their bottlenecks include instruction latency, branch mispredicts, and maybe cache miss latency, rather than instruction throughput or memory bandwidth. Depending on the workload you're sharing a core with, there might be a lot more than half the front-end bandwidth available, or there might not, so it depends even more on the nature of the workload.) If some runs have two tasks sharing a physical core and some don't, the benchmark numbers will be wildly different. If you want to test SMT-friendliness between two tasks, pin them to sibling cores to benchmark them together intentionally.

On hybrid CPUs like Intel Alder Lake or Apple M1 with a mix of Performance and Efficiency cores, running more tasks than you have P cores will lead to some being scheduled to slower E cores. For consistent benchmarking you probably want to pin single-threaded tasks to either a P core or an E core to see how it performs separately on each of those. (Also, Intel E-cores share a big L2 between a cluster of 4 E-cores, with only the L1 caches being fully private. https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/)

With just one task running, it should quickly end up on a P-core at max turbo even if you don't do anything special. (Assuming it doesn't sleep or wait for I/O.)