profilinggperftools

How to use google perf tools


I just started using google performance tools (google-perftools and libgoogle-perftools4 packages in ubuntu), I swear that I'm googling for around a day and I didn't find an answer!! The problem is that I do not get the result for ALL of my functions with CPU profiling. This is my code:

#include "gperftools/profiler.h"
#include <iostream>
#include <math.h>
using namespace std;
void bar()
{
        int a,b,c,d,j,k;
        a=0;
        int z=0;
        b = 1000;
        while(z < b)
        {
                while (a < b)
                {
                        d = sin(a);
                        c = cos(a);
                        j = tan(a);
                        k = tan(a);
                        k = d * c + j *k;
                        a++;
                }
                a = 0;
                z++;
        }
}
void foo()
{
        cout << "hey " << endl;
}

int main()
{
        ProfilerStart("/home/mohammad/gperf/dump.txt");

        int a = 1000;
        while(a--){foo();}
        bar();
        ProfilerFlush();
        ProfilerStop();
}

Compiled as g++ test.cc -lprofiler -o a.out

this is how I run the code:

CPUPROFILE=dump.txt ./a.out

I've also tried this:

CPUPROFILE_FREQUENCY=10000 LD_PRELOAD=/usr/local/lib/libprofiler.so.0.3.0 CPUPROFILE=dump.txt ./a.out

And this is what I get from google-pprof --text a.out dump.txt:

Using local file ./a.out.
Using local file ./dump.txt.
Total: 22 samples
8  36.4%  36.4%        8  36.4% 00d8cb04
6  27.3%  63.6%        6  27.3% bar
3  13.6%  77.3%        3  13.6% __cos (inline)
2   9.1%  86.4%        2   9.1% 00d8cab4
1   4.5%  90.9%        1   4.5% 00d8cab6
1   4.5%  95.5%        1   4.5% 00d8cb06
1   4.5% 100.0%        1   4.5% __write_nocancel
0   0.0% 100.0%        3  13.6% __cos

But there is no information about the foo function!

my system information: ubuntu 12.04 g++ 4.6.3

Thats all!


Solution

  • TL;DR: foo is to fast and small to get profiling events, run it 100 more times. Frequency setting was with typo, and pprof will not sample more often than CONFIG_HZ (usually 250). It is better to switch to more modern Linux perf profiler (tutorial from its authors, wikipedia).

    Long version:

    Your foo function is just too short and simple - just call two functions. Compiled the test with g++ test.cc -lprofiler -o test.s -S -g, with filtering of test.s with c++filt program to make c++ names readable:

    foo():
    .LFB972:
            .loc 1 27 0
            pushq   %rbp
            movq    %rsp, %rbp
            .loc 1 28 0
            movl    $.LC0, %esi
            movl    std::cout, %edi
            call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)
            movl    std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&), %esi
            movq    %rax, %rdi
            call    std::basic_ostream<char, std::char_traits<char> >::operator<<(std::basic_ostream<char, std::char_traits<char> >& (*)(std::basic_ostream<char, std::char_traits<char> >&))
            .loc 1 29 0
            popq    %rbp
            ret
    .LFE972:
            .size   foo(), .-foo()
    

    So, to see it in the profile you should run foo for more times, changing int a = 1000; in main to something much greater, like 10000 or better 100000 (as did I for the test).

    Also you may fix the incorrect "CPUPROFILE_FREQUENC=10000" to correct CPUPROFILE_FREQUENCY (note the Y). I should say that 10000 is too high setting for CPUPROFILE_FREQUENCY, because it usually may only generate 1000 or 250 events per second depending on kernel configuration CONFIG_HZ (most 3.x kernels have 250, check grep CONFIG_HZ= /boot/config*). And default setting for CPUPROFILE_FREQUENCY in pprof is 100.

    I tested different values of CPUPROFILE_FREQUENCY: 100000, 10000, 1000, 250 with bash script on Ubuntu 14.04

    for a in 100000 100000 10000 10000 1000 1000 300 300 250 250; do 
       echo -n "$a "; 
       CPUPROFILE_FREQUENCY=$a CPUPROFILE=dump$a.txt ./test >/dev/null;
    done
    

    And the results were the same 120-140 events and runtime of every ./test around 0.5 seconds, so cpuprofiler from google-perftools can't do more events per second for single thread, than CONFIG_HZ set in kernel (my has 250).

    100000 PROFILE: interrupts/evictions/bytes = 124/1/6584
    100000 PROFILE: interrupts/evictions/bytes = 134/0/7864
    10000 PROFILE: interrupts/evictions/bytes = 125/0/7488
    10000 PROFILE: interrupts/evictions/bytes = 123/0/6960
    1000 PROFILE: interrupts/evictions/bytes = 134/0/6264
    1000 PROFILE: interrupts/evictions/bytes = 125/2/7272
    300 PROFILE: interrupts/evictions/bytes = 137/2/7984
    300 PROFILE: interrupts/evictions/bytes = 126/0/7216
    250 PROFILE: interrupts/evictions/bytes = 123/3/6680
    250 PROFILE: interrupts/evictions/bytes = 137/2/7352
    

    With original a=1000 foo and cout's functions runs too fast to get any profiling event (even on 250 events/s) on them in every run, so you have no foo, nor any Input/Output functions. In small amount of runs, the __write_nocancel may got sampling event, and then foo and I/O functions form libstdc++ will be reported (somewhere not in the very top, so use --text option of pprof or google-pprof) with zero self event count, and non-zero child event count:

     ....
       1   0.9%  99.1%        1   0.9% __write_nocancel
     ....
       0   0.0% 100.0%        1   0.9% _IO_new_file_overflow
       0   0.0% 100.0%        1   0.9% _IO_new_file_write
       0   0.0% 100.0%        1   0.9% __GI__IO_putc
       0   0.0% 100.0%        1   0.9% foo
       0   0.0% 100.0%        1   0.9% new_do_write
       0   0.0% 100.0%        1   0.9% std::endl
       0   0.0% 100.0%        1   0.9% std::ostream::put
    

    With a=100000, foo is still too short and fast to get own events, but I/O functions got several. This is list I grepped from long --text output:

      34  24.6%  24.6%       34  24.6% __write_nocancel
    
       1   0.7%  95.7%       35  25.4% __GI__IO_fflush
       1   0.7%  96.4%        1   0.7% __GI__IO_putc
       1   0.7%  97.8%        2   1.4% std::operator<< 
       1   0.7%  98.6%       36  26.1% std::ostream::flush
       1   0.7%  99.3%        2   1.4% std::ostream::put
       0   0.0% 100.0%       34  24.6% _IO_new_file_sync
       0   0.0% 100.0%       34  24.6% _IO_new_file_write
       0   0.0% 100.0%       40  29.0% foo
    
       0   0.0% 100.0%       34  24.6% new_do_write
    
       0   0.0% 100.0%        2   1.4% std::endl
    

    Functions with zero own counters seen only thanks to pprof capability of reading call chains (it knows who calls the functions which got sample, if frame info is not omitted).

    I can also recommend more modern, more capable (both software and hardware events, up to 5 kHz frequency or more; both user-space and kernel profiling) and better supported profiler, the Linux perf profiler (tutorial from its authors, wikipedia).

    There are results from perf with a=10000:

    $ perf record  ./test3  >/dev/null
     ... skip some perf's spam about inaccessibility of kernel symbols 
     ... note the 3 kHz frequency here VVVV
    Lowering default frequency rate to 3250. 
    Please consider tweaking /proc/sys/kernel/perf_event_max_sample_rate.
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.078 MB perf.data (~3386 samples) ]
    

    To see text report from perf.data output file I'll use less (because perf report by default starts interactive profile browser):

    $ perf report |less
    ... skip some extra info about the machine, kernel, and perf starting command
    # Samples: 1K of event 'cycles'
    # Event count (approx.): 1155264208
    # Overhead   Command   Shared Object          Symbol
        41.94%    test3  libm-2.19.so         [.] __tan_sse2                                                                                                                                                                    
        16.95%    test3  libm-2.19.so         [.] __sin_sse2    
        13.40%    test3  libm-2.19.so         [.] __cos_sse2                                                                                                                                                                    
         4.93%    test3  test3                [.] bar()                                                                                                                                                                         
         2.90%    test3  libc-2.19.so         [.] __GI___libc_write    
    ....
         0.20%    test3  test3                [.] foo()                                                                                                                                                                         
    

    Or perf report -n | less to see raw event (sample) counters:

    # Overhead       Samples  Command        Shared Object 
        41.94%           663    test3  libm-2.19.so         [.] __tan_sse2                                                                                                                                                                    
        16.95%           268    test3  libm-2.19.so         [.] __sin_sse2   
        13.40%           212    test3  libm-2.19.so         [.] __cos_sse2                                                                                                                                                                    
         4.93%            78    test3  test3                [.] bar()                                                                                                                                                                         
         2.90%            62    test3  libc-2.19.so         [.] __GI___libc_write     
     ....
         0.20%             4    test3  test3                [.] foo()