c++linuxoptimizationx86perf

Perf shows userspace cycles for cores which shouldn't be executing instructions


I have a test application like this:

int main()
{
    // calls sched_setaffinity() to set affinity to core 0

    while(true)
    {
    }

    return 0;
}

I have 4x logical cores across 2x physical cores.

I would like to see perf event counters but only for the CPU core my app uses.

I run this Perf command (killing it after a few seconds):

sudo perf stat -e cycles:u --cpu=0 --delay=1000 ./app

four times, each time changing the cpu id, it shows cycles:u > 0.

How can all four cpus be executing userspace cycles for an application which pins to core 0?

Surely three of the cores should not be executing userspace cycles for my application?


Solution

  • It appears the docs for perf stat --cpu= are wrong or misleading when they say The -a option is still necessary to activate system-wide monitoring.

    It seems to behave identically with or without -a, so the counts are from other processes, like GUI animations, cursor blink, etc. Or for cycles:k, from any interrupt handler regardless of which process is the kernel's current on that core.


    I tested this on Arch GNU/Linux system with kernel and perf 6.5. The smoking gun here is running another infinite-loop process pinned to a different core, and seeing cycles:u on that core the same as the core running the command from the perf command-line.

    Baseline without any funky options, with awk 'BEGIN{for(i=0;i<100000000;i++){}}' as the command being profiled. (It keeps a CPU core busy for a few seconds, making no system calls except at start and exit, and having a small cache footprint.)

    $ taskset -c 1 perf stat -e task-clock,cycles:u  --delay=1000   awk 'BEGIN{for(i=0;i<100000000;i++){}}'
    Events disabled
    Events enabled
    
     Performance counter stats for 'awk BEGIN{for(i=0;i<100000000;i++){}}':
    
              2,625.29 msec task-clock                       #    1.000 CPUs utilized             
        10,226,974,370      cycles:u                                                              
    
           2.625290561 seconds time elapsed
    

    vs. with --cpu, two signs of not limiting itself to the awk process: task-clock is 8x longer than the elapsed time. And the header it prints: Performance counter stats for 'CPU(s) 0-7': rather than for our process.

    taskset -c 1 perf stat -e task-clock,cycles:u --cpu=0-7  --delay=1000   awk 'BEGIN{for(i=0;i<100000000;i++){}}'
    Events disabled
    Events enabled
    
     Performance counter stats for 'CPU(s) 0-7':
    
             23,621.12 msec task-clock                       #    8.000 CPUs utilized             
        12,747,549,694      cycles:u                                                              
    
           2.952544251 seconds time elapsed
    

    And with -A aka --no-aggr to not aggregate across CPUs, each one has task-clock running for the full interval even though our process definitely isn't on it (since it and perf itself started pinned to CPU 1):

    $ taskset -c 1 perf stat -e task-clock,cycles:u --cpu=0-7 -A  --delay=1000   awk 'BEGIN{for(i=0;i<100000000;i++){}}'
    Events disabled
    Events enabled
    
     Performance counter stats for 'CPU(s) 0-7':
    
    CPU0             2,605.76 msec task-clock                       #    1.000 CPUs utilized             
    CPU1             2,605.75 msec task-clock                       #    1.000 CPUs utilized             
    CPU2             2,605.76 msec task-clock                       #    1.000 CPUs utilized             
    CPU3             2,605.75 msec task-clock                       #    1.000 CPUs utilized             
    CPU4             2,605.78 msec task-clock                       #    1.000 CPUs utilized             
    CPU5             2,605.78 msec task-clock                       #    1.000 CPUs utilized             
    CPU6             2,605.78 msec task-clock                       #    1.000 CPUs utilized             
    CPU7             2,605.78 msec task-clock                       #    1.000 CPUs utilized             
    CPU0          162,311,044      cycles:u                                                              
    CPU1       10,153,813,410      cycles:u                                                              
    CPU2           12,114,460      cycles:u                                                              
    CPU3           49,619,566      cycles:u                                                              
    CPU4          180,412,777      cycles:u                                                              
    CPU5           51,045,799      cycles:u                                                              
    CPU6          132,275,644      cycles:u                                                              
    CPU7          344,942,273      cycles:u                                                              
    
           2.605703012 seconds time elapsed
    

    And with a dummy load (like another awk process, or in this case I ran program that just loops on pause instruction, which isn't special other than using less power), we see it uses as many cycles core CPU 2 as awk did on CPU 1.

    $ taskset -c 2 timeout 10m ~/src/SO/pauseloop &
        # actually on another terminal
    $ taskset -c 1 perf stat -e task-clock,cycles:u --cpu=0-7 -A  --delay=1000   awk 'BEGIN{for(i=0;i<100000000;i++){}}'
    Events disabled
    Events enabled
    
     Performance counter stats for 'CPU(s) 0-7':
    
    CPU0             2,721.71 msec task-clock                       #    1.000 CPUs utilized             
    CPU1             2,721.71 msec task-clock                       #    1.000 CPUs utilized             
    CPU2             2,721.71 msec task-clock                       #    1.000 CPUs utilized             
    CPU3             2,721.71 msec task-clock                       #    1.000 CPUs utilized             
    CPU4             2,721.70 msec task-clock                       #    1.000 CPUs utilized             
    CPU5             2,721.70 msec task-clock                       #    1.000 CPUs utilized             
    CPU6             2,721.69 msec task-clock                       #    1.000 CPUs utilized             
    CPU7             2,721.69 msec task-clock                       #    1.000 CPUs utilized             
    CPU0          861,492,255      cycles:u                                                              
    CPU1       10,485,755,009      cycles:u                                                              
    CPU2       10,485,655,861      cycles:u                                                              
    CPU3          559,708,162      cycles:u                                                              
    CPU4        1,207,134,025      cycles:u                                                              
    CPU5          556,142,162      cycles:u                                                              
    CPU6          623,020,558      cycles:u                                                              
    CPU7        3,721,131,901      cycles:u                                                              
    
           2.721620976 seconds time elapsed
    

    Note the 10,485,755,009 vs. 10,485,655,861 counts for cycles:u on cores 1 and 2. The same to within one part per 10k. (At 3.9GHz on my i7-6700k, and the absolute difference is about 1 million, or about 250 nanoseconds, which could easily be accounted for by differences in starting / stopping counting on different cores.)


    -a --cpu only changes the header message, not counts

    Adding -a to any of these (except the one without --cpu) changes nothing except the header message, to Performance counter stats for 'system wide': instead of 'CPU(s) 0-7'

    Everything I'm seeing is consistent with --cpu actually counting everything happening on that core wile perf stat is running, just like -a mode.