[SOLVED] Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks

Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks

I have the latest coffeelake machine which is primarily used as a storage server. The average workload on each core (4 cores) is around 5-10% when running a storage server alone.

I want to run vtune measurements of a workload on this machine using Intel Sampling drivers. However, I'm doubtful whether or not the measurements will be accurate given the storage server application is concurrently running.

But as the intel's documents suggest, the sampling drivers get installed on the Linux kernel, so is it really the case that the measurements will be inaccurate if run concurrently with other applications? In other words, how exactly do the intel sampling drivers work? Are they able to distinguish between the workload process and other processes running on the system?

Solution

If VTune is like the Linux PAPI subsystem that perf uses, it basically saves/restores HW event counter registers on context switch, along with the regular register state. So events like instructions and uops_retired should be unaffected. And effects on other events will be due to actual impacts, like extra cache misses.

(The basic mechanism for HW performance events are that each logical core has its own programmable perf counters that increment every time some microarchitectural event happens. If one overflows, it raises an interrupt for the driver to collect the count. Or for perf record type of functionality, perf or VTune would program them to count down so trigger an interrupt regularly, and sample the saved user-space RIP at that point. This produces some funky effects on a superscalar out-of-order CPU, like "blaming" the instruction waiting for data, not the cache miss load itself, for example. But the key point is that the inside-the-core events are totally per-core. The uncore / L3 cache events count stuff about shared resources like L3 cache, so are more easily disturbed by system load.)

Another point is that if you are running something on a CPU core, Linux isn't going to want to schedule other tasks there. So your background load will tend to avoid whichever core your test is running on, leaving it able to use 100% of a single core without a lot of context switches. (Although network / disk interrupts might still be handled on that core.)

So yes, you should be able to fairly accurately measure what's actually happening in your process while it runs on a system that's not totally idle. That might be a bit different from what would happen if it were run on a fully idle system, but probably not much different. Especially if it's single-threaded, or you can limit it to fewer than all of your cores, so there's at least one left for the OS to schedule other tasks onto.