I'm attempting to optimize a program I wrote which aims to replicate network flows by sending packets to a specified MAC address.
The main loop of my program that is responsible for the sending and removal of flows is as follows:
while (size != 0 || response) {
for (i = 0; size != 0 && i < size; ++i) {
curFlow = *pCurFlow;
while (curFlow.cur_time < now) {
// Sending Packet
sendto(sockfd, curFlow.buff, curFlow.length, 0, \
memAddr, sAddrSize);
// Adjusting Packet Attributes
curFlow.packets_left -= 1;
curFlow.cur_time += curFlow.d_time;
// If the packet has no packets left, delete it
if (!curFlow.packets_left) {
pCurFlow -> last -> next = pCurFlow -> next;
pCurFlow -> next -> last = pCurFlow -> last;
size -= 1;
break;
}
}
*pCurFlow = curFlow;
pCurFlow = pCurFlow -> next;
}
}
I've begun using the perf profiler to record what sort of function calls I'm making and how expensive each overhead is. However, every time I ask perf to give me a report, the outcome looks like:
Overhead Command Shared Object Symbol
15.34% packetize /proc/kcore 0x7fff9c805b73 k [k] do_syscall_64
6.19% packetize /proc/kcore 0x7fff9d20214f k [k] syscall_return_via_sysret
5.98% packetize /proc/kcore 0x7fff9d1a3de6 k [k] _raw_spin_lock
5.29% packetize /proc/kcore 0x7fffc0512e9f k [k] mlx4_en_xmit
5.26% packetize /proc/kcore 0x7fff9d16784d k [k] packet_sendmsg
(Note: "packetize" is the name of my program)
My question is, what the heck is "do_syscall_64"?? After conducting some research, it seems like this particular function is a kernel tool used as an interrupt request.
Furthermore, I've found that the directory /proc/kcore is responsible for some components of memory management, although upon purposefully ransacking my program with memory references, the dynamic library I use with my program was the only overhead that increased from perf report.
Please let me know if you have any advice for me. Thank you!
It's not an interrupt request; it's the C function called from the syscall
entry point that dispatches to the appropriate C function that implements the system call selected by a register passed by user-space.
Presumably sys_sendto
in this case.
In older versions of Linux, the x86-64 syscall
entry point used the system-call table of function pointers directly (e.g. as shown in this Q&A where only the 32-bit entry points like for int 0x80
used a C wrapper function).
But with the changes for Spectre and Meltdown mitigation, the native 64-bit system call entry point (into a 64-bit kernel from 64-bit user-space) also uses a C wrapper around system call dispatching. This allows using C macros and gcc hints to control speculation barriers before the indirect branch. The current Linux version of do_syscall_64
on github is a pretty simple function; it's somewhat surprising it's getting so many cycles itself unless nr = array_index_nospec(nr, NR_syscalls);
is a lot more expensive than I'd expect on your CPU.
There's definitely expensive stuff that happens in the hand-written-asm syscall entry point, e.g. writing the MSR that flushes the branch-prediction cache. Oh, maybe lack of good branch prediction is costing extra cycles in the first C function called after that.
System-call intensive workloads suffer a lot from Spectre / Meltdown mitigations. It might be interesting to try booting with some of them disabled, and/or with an older kernel that doesn't have that code at all.
Meltdown / L1TF / etc. are completely fixed in the newest Intel CPUs with no performance cost, so disabling workarounds for that might give you some clue how much benefit you'd get from a brand new CPU.
(Spectre is still a very hard problem and can't be easily fixed with a local change to the load ports. IDK what the details are of how efficient various mitigation microcode-assisted or not strategies for mitigating it are on various CPUs.)