Message Ordering with Asynchronous I/O (epoll)

Say that I've implemented a epoll-based TCP server where each thread is running something very similar to the below (taken from the epoll manpage where kdpfd is the epoll file descriptor and listener is a socket that is listening on a port):

struct epoll_event ev, *events;
for(;;) {
    nfds = epoll_wait(kdpfd, events, maxevents, -1);
    for(n = 0; n < nfds; ++n) {
        if(events[n].data.fd == listener) {
            client = accept(listener, (struct sockaddr *) &local,
                            &addrlen);
            if(client < 0){
                perror("accept");
                continue;
            }
            setnonblocking(client);
            ev.events = EPOLLIN | EPOLLET;
            ev.data.fd = client;
            if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
                fprintf(stderr, "epoll set insertion error: fd=%d0,
                        client);
                return -1;
            }
        }
        else
            do_use_fd(events[n].data.fd);
    }
}

For the do_use_fd(events[n].data.fd) above, say we want to write everything we receive to stdout:

int do_use_fd(int fd) {
    int err;
    char buf[512];
    while ((err = read(fd, buf, 512)) > 0) {
        write(1, buf, err);
    }

    if (err == -1 && errno != EAGAIN && errno != EWOULDBLOCK)
       // do some error handling and return -1

    return 0;
}

Now, say I have 10k+ connections, all of who send me a lot of messages over a prolonged period of time. Assume that my clients send me the message hello, my name is {client's name} every few seconds. Assume that (somehow) this message is large enough that it has to be transfered as multiple packets.

As such, read(fd, buf, 512) may occasionally return -1 with an errno indicating it would block. As such, I think the above solution could end up with the something like following output:

hello, my nam
hello, my name is Pau
e is John Le
hello, my name is Geo
nnon
l McCartney
rge
hello, my name is Ringo
Starr
 Harrison

because as soon as a read blocks on one connection, another read can start on a different connection. Instead, I'd like the following to be printed:

hello, my name is John Lennon
hello, my name is Paul McCartney
hello, my name is George Harrison
hello, my name is Ringo Starr

Is there a recommended way of dealing with this issue? One option would be to keep a buffer per connection, and check if the message is completed and only print once this happens. But with 10k+ connections, would this be a good idea? On one hand, something tells me this solution does not scale well. On the other hand, if the messages are only 500 bytes, with 10k connections, this solution is only going to take up 5MB.

Thanks in advance.

Solution

I think using a buffer per connection would be OK in your case. It may however be more elegant to create a buffer per incomplete message. That would mean that you somehow have to know when your message is done, so you would need a small protocol, such as using a length field or a terminator (, and possibly a timeout to kill incomplete messages after a certain time). This would also guarantee that no unused memory is allocated, as the buffer could be released right after the message is complete and passed up. You could for example access these buffers through a hashmap using the connection 5-tuple as key. If you decide to use a message-bound identifier, which of course will incur extra overhead, you could even demux messages from a single tcp-connection used to transmit multiple messages at a time.

If you need to enforce ordering among these messages you will have to detail your situation, because ordering is a tough problem in many situations.

Edit: Sorry, I have a lot to do at the moment, so I could not answer any sooner. You are correct that using a connection-based approach is easier. Message-based is the more advantageous the sparser the connections are used. If you can expect all connections to receive messages at all times it is just an overhead. If connections are sometimes idle for a while it may reduce the memory usage considerably though. Also note that your applications memory usage no longer scales with the number of clients but the the number of messages, which is usually nice, because message-rates typically vary. You are also correct about the ordering on a TCP-stream. As long as you send only one complete message at a time over the connection, TCP will ensure ordering. Some applications e.g., HTTP2 reuse the same TCP-connection to send multiple messages at the same time. In that case TCP will not be helpful, because message fragments arrive in an unspecified order and you need to demultiplex them (e.g. via stream-ids in HTTP2).