linuxperformanceoptimizationx86perf

Perf stat doesn't count events when using delay switch


I am trying to profile my application 20 seconds after starting because I'm not interested in the initialization stage.

So I used perf with the delay (microseconds) switch:

sudo perf stat --delay=20000 -e page-faults my_app

Initially I was getting the error:

Some events weren't counted. Try disabling the NMI watchdog:
    echo 0 > /proc/sys/kernel/nmi_watchdog
    perf stat ...
    echo 1 > /proc/sys/kernel/nmi_watchdog

So I tried this:

echo 0 | sudo tee  /proc/sys/kernel/nmi_watchdog

The error disappeared but Perf isn't counting the event:

 <not counted>      page-faults                                                                                     

I then tried again, removing the delay switch:

sudo perf stat -e page-faults my_app

and it works.

When using the delay switch, it does output events enabled.

Am I using this switch wrong/how can I begin profiling after N seconds?


Solution

  • Works for me with perf version 6.3-4 on x86-64 Arch GNU/Linux, with Linux kernel 6.4.9-arch1-1. In case the CPU matters for software events, it's an i7-6700k Skylake.

    Did your code exit before the 20 second delay was over? In that case, no events get counted for any counter. I can reproduce your symptoms with a busy loop that exits after about 3.6 seconds on my system:

    perf stat --delay=20000 -e task-clock,cycles,page-faults  \
        awk 'BEGIN{for(i=0;i<100000000;i++){}}'
    

    It prints Events disabled, then Events enabled right as it's exiting, but <not counted> for both the hardware (cycles) and software (task-clock and page-faults) events.

    Reducing the delay to 2000ms (2 seconds), it does count for the second ~half of the 3.6 second run time:

    $ perf stat --delay=2000 -e task-clock,cycles,page-faults  awk 'BEGIN{for(i=0;i<100000000;i++){}}'
    Events disabled
    Events enabled
    
     Performance counter stats for 'awk BEGIN{for(i=0;i<100000000;i++){}}':
    
              1,591.87 msec task-clock                       #    1.000 CPUs utilized             
         6,208,241,402      cycles                           #    3.900 GHz                       
                     6      page-faults                      #    3.769 /sec                      
    
           1.591902071 seconds time elapsed
    
           3.590486000 seconds user
           0.000000000 seconds sys
    

    awk is single-threaded, but note the 1.59 sec elapsed vs. 3.59 sec "user" time: 2 seconds less, the time during which it wasn't counting. I guess the user/sys times come from OS per-process counters, rather than perf looking at the wall-clock.