ckernelpacketperfsendto

Perf Profiler Reporting Excess Use of "do_syscall_64"


I'm attempting to optimize a program I wrote which aims to replicate network flows by sending packets to a specified MAC address.

The main loop of my program that is responsible for the sending and removal of flows is as follows:

while (size != 0 || response) {
    for (i = 0; size != 0 && i < size; ++i) {
       curFlow = *pCurFlow;
       while (curFlow.cur_time < now) {
           // Sending Packet
           sendto(sockfd, curFlow.buff, curFlow.length, 0, \
                 memAddr, sAddrSize);

           // Adjusting Packet Attributes
           curFlow.packets_left -= 1;
           curFlow.cur_time += curFlow.d_time;

           // If the packet has no packets left, delete it
           if (!curFlow.packets_left) {
                pCurFlow -> last -> next = pCurFlow -> next;
                pCurFlow -> next -> last = pCurFlow -> last;
                size -= 1;
                break;
            }
        }
        *pCurFlow = curFlow;
        pCurFlow = pCurFlow -> next;
    }
}

I've begun using the perf profiler to record what sort of function calls I'm making and how expensive each overhead is. However, every time I ask perf to give me a report, the outcome looks like:

Overhead  Command    Shared Object  Symbol
15.34%   packetize   /proc/kcore    0x7fff9c805b73   k [k] do_syscall_64
6.19%    packetize   /proc/kcore    0x7fff9d20214f   k [k] syscall_return_via_sysret
5.98%    packetize   /proc/kcore    0x7fff9d1a3de6   k [k] _raw_spin_lock      
5.29%    packetize   /proc/kcore    0x7fffc0512e9f   k [k] mlx4_en_xmit
5.26%    packetize   /proc/kcore    0x7fff9d16784d   k [k] packet_sendmsg

(Note: "packetize" is the name of my program)

My question is, what the heck is "do_syscall_64"?? After conducting some research, it seems like this particular function is a kernel tool used as an interrupt request.

Furthermore, I've found that the directory /proc/kcore is responsible for some components of memory management, although upon purposefully ransacking my program with memory references, the dynamic library I use with my program was the only overhead that increased from perf report.

Please let me know if you have any advice for me. Thank you!


Solution

  • It's not an interrupt request; it's the C function called from the syscall entry point that dispatches to the appropriate C function that implements the system call selected by a register passed by user-space.

    Presumably sys_sendto in this case.

    In older versions of Linux, the x86-64 syscall entry point used the system-call table of function pointers directly (e.g. as shown in this Q&A where only the 32-bit entry points like for int 0x80 used a C wrapper function).

    But with the changes for Spectre and Meltdown mitigation, the native 64-bit system call entry point (into a 64-bit kernel from 64-bit user-space) also uses a C wrapper around system call dispatching. This allows using C macros and gcc hints to control speculation barriers before the indirect branch. The current Linux version of do_syscall_64 on github is a pretty simple function; it's somewhat surprising it's getting so many cycles itself unless nr = array_index_nospec(nr, NR_syscalls); is a lot more expensive than I'd expect on your CPU.

    There's definitely expensive stuff that happens in the hand-written-asm syscall entry point, e.g. writing the MSR that flushes the branch-prediction cache. Oh, maybe lack of good branch prediction is costing extra cycles in the first C function called after that.

    System-call intensive workloads suffer a lot from Spectre / Meltdown mitigations. It might be interesting to try booting with some of them disabled, and/or with an older kernel that doesn't have that code at all.

    Meltdown / L1TF / etc. are completely fixed in the newest Intel CPUs with no performance cost, so disabling workarounds for that might give you some clue how much benefit you'd get from a brand new CPU.

    (Spectre is still a very hard problem and can't be easily fixed with a local change to the load ports. IDK what the details are of how efficient various mitigation microcode-assisted or not strategies for mitigating it are on various CPUs.)