Short version: I don’t want to use third party libraries or framesworks like Netmap or DPDK, is there anything faster than poll()
or select()
, or can I make those calls more efficient?
Full version: I have a single threaded application which uses a single socket to send data as fast as possible and I'm using valgrind/cachegrind/callgrind to try and increase the efficiency (and thus, throughput of the application).
At present the receiving host sites in an infinite loop trying to check received data as fast as possible (which needs to be non-blocking, when there is no packet to process "other stuff" is done). I am using select()
on the receiving host because select()
uses a timeval which offers microsecond polling frequency whereas poll()
uses a millisecond value. Callgrind is showing me that FD_SET()
has made nearly twice as many instruction executions as the select()
operation, and one downside to select()
is that I must run FD_SET()
every iteration of the receive loop:
. . . . . . . . . // Poll for incoming frames
4,361,236 1 1 2,180,618 0 0 2,180,618 0 0 TEST_INTERFACE->TV_SELECT_DELAY.tv_sec = 0;
4,361,236 0 0 2,180,618 0 0 2,180,618 0 0 TEST_INTERFACE->TV_SELECT_DELAY.tv_usec = 000000;
56,696,068 2 2 15,264,326 0 0 2,180,618 0 0 FD_SET(TEST_INTERFACE->SOCKET_FD, &TEST_INTERFACE->FD_READS);
. . . . . . . . .
. . . . . . . . . TEST_INTERFACE->SELECT_RET_VAL = select(TEST_INTERFACE->SOCKET_FD_COUNT,
. . . . . . . . . &TEST_INTERFACE->FD_READS,
. . . . . . . . . NULL, NULL,
30,528,652 1 1 10,903,090 0 0 15,264,326 0 0 &TEST_INTERFACE->TV_SELECT_DELAY);
. . . . . . . . .
. . . . . . . . .
9,432,052 0 0 4,361,236 0 0 0 0 0 if (TEST_INTERFACE->SELECT_RET_VAL > 0 &&
7,450,590 1 1 2,128,740 0 0 0 0 0 FD_ISSET(TEST_INTERFACE->SOCKET_FD, &TEST_INTERFACE->FD_READS))
. . . . . . . . . {
. . . . . . . . .
. . . . . . . . . RX_LEN = recvfrom(TEST_INTERFACE->SOCKET_FD,
. . . . . . . . . FRAME_HEADERS->RX_BUFFER,
. . . . . . . . . TEST_PARAMS->F_SIZE_TOTAL,
5,321,850 1 1 2,128,740 0 0 2,838,320 0 0 0, NULL, NULL);
. . . . . . . . .
I am getting an average of 130-140Mpbs of receive throughput using select()
. Using poll()
I am getting an average of 150-160Mbps.
. . . . . . . . . // Poll for incoming frames
7,347,032 2 2 1,836,758 0 0 4,591,895 0 0 TEST_INTERFACE->SELECT_RET_VAL = poll(TEST_INTERFACE->fds, 1, 0);
. . . . . . . . .
. . . . . . . . .
3,673,516 0 0 1,836,758 0 0 0 0 0 if (TEST_INTERFACE->SELECT_RET_VAL > 0)
. . . . . . . . . {
. . . . . . . . .
2,142,186 0 0 714,062 0 0 0 0 0 if ( TEST_INTERFACE->fds[0].revents & POLLIN )
714,062 0 0 357,031 0 0 357,031 0 0 TEST_INTERFACE->fds[0].revents = 0;
. . . . . . . . .
. . . . . . . . . RX_LEN = recvfrom(TEST_INTERFACE->SOCKET_FD,
. . . . . . . . . FRAME_HEADERS->RX_BUFFER,
. . . . . . . . . TEST_PARAMS->F_SIZE_TOTAL,
5,355,465 1 1 2,142,186 0 0 2,856,248 0 0 0, NULL, NULL);
With poll()
I am passing a timeout value of 0
so hopefully that is not blocking at all for anytime, but I'm not sure if that is correct and it is actually faster than select()
.
In the lab I have a couple of servers back to back with 10Gbps NICs. Using select()
with the Tx hosting sending data at line rate 10Gbps the Rx host can process about 9.5Gbps of traffic. The above results are just between to virtual machines on my laptop (the lab is out of action at present) to show that poll()
is giving me a slight increase on the Rx host. In the above results we can see that poll()
was faster however 10Gbps is a packet every few tens of nano-seconds so I’m not sure (when the lab is working again) if I will see any improvement.
So I guess I have two questions:
In my poll()
vs select()
tests, since I am using a single socket poll()
is slightly faster but will this hold true on faster connections, 10/20/30/40Gbps, or at those speed would select()
be faster?
Since I am only using a single socket, is there any speed increase if I switched away from select()
/poll()
to epoll()
? Or is there some other “built in” method I can use, that is faster than poll()
and select()
, or something I can do to reduce their execution time?
If the fd(s) to select don't change you can (generally) cheat and just use a precomputed int.
int fds = 1 << fd;
int fdp;
do {
fdp = fds;
n = select(fd+1, &fdp, ...
If FD_SET is the bottleneck, this should help some.
One of the inefficiencies with select is copying big fd_sets in/out of kernel space -- insuring that the fd is low enough to fit in int helps.
I'm going to assume threads are off the table.
You may also get some help from using a non-blocking socket and only calling poll/select when it says EAGAIN (and you have none of your "other work" to do).