c++tcpcircular-bufferlocklesszero-copy

How to implement zero-copy tcp using lock-free circular buffer in C++


I have multiple threads that need to consume data from a TCP stream. I wish to use a circular buffer/queue in shared memory to read from the TCP socket. The TCP receive will write directly to the circular queue. The consumers will read from the queue.

This design should enable zero-copy and zero-lock. However there are 2 different issues here.

  1. Is it possible/efficient to read just 1 logical message from the TCP socket? If not, and I read more than 1 message, I will have to copy the residuals from this to this->next.

  2. Is it really possible to implement a lock-less queue? I know there are atomic operations, but these can be costly too. because all CPU cache needs to be invalidated. This will effect all operations on all of my 24 cores.

Im a little rusty in low-level TCP, and not exactly clear how to tell when a message is complete. Do I look for \0 or is it implementation specific?

ty


Solution

  • Unfortunately, TCP cannot transfer messages, only byte streams. If you want to transfer messages, you will have to apply a protocol on top. The best protocols for high performance are those that use a sanity-checkable header specifying the message length - this allows you to read the correct amount ot data directly into a suitable buffer object without iterating the data byte-by-byte looking for an end-of-message character. The buffer POINTER can then be queued off to another thread and a new buffer object created/depooled for the next message. This avoids any copying of bulk data and, for large messages, is sufficiently efficient that using a non-blocking queue for the message object pointers is somewhat pointless.

    The next optimization avaialble is to pool the object *buffers to avoid continual new/dispose, recycling the *buffers in the consumer thread for re-use in the network receiving thread. This is fairly easy to do with a ConcurrentQueue, preferably blocking to allow flow-control instead of data corruption or segfaults/AV if the pool empties temporarily.

    Next, add a [cacheline size] 'dead-zone' at the start of each *buffer data member, so preventing any thread from false-sharing data with any other.

    The result should be a high-bandwith flow of complete messages into the consumer thread with very little latency, CPU waste or cache-thrashing. All your 24 cores can run flat-out on different data.

    Copying bulk data in multithreaded apps is an admission of poor design and defeat.

    Follow up..

    Sounds like you're stuck with iterating the data because of the different protocols:(

    False-sharing-free PDU buffer object, example:

    typedef struct{
      char deadZone[256];  // anti-false-sharing
      int dataLen;
      char data[8388608]; // 8 meg of data
    } SbufferData;
    
    class TdataBuffer: public{
    private:
      TbufferPool *myPool; // reference to pool used, in case more than one
      EpduState PDUstate; // enum state variable used to decode protocol
    protected:
      SbufferData netData;
    public:
      virtual reInit(); // zeros dataLen, resets PDUstate etc. - call when depooling a buffer
      virtual int loadPDU(char *fromHere,int len);  // loads protocol unit
      release(); // pushes 'this' back onto 'myPool'
    };
    

    loadPDU gets passed a pointer to, length of, raw network data. It returns either 0 - means that it has not yet completely assembled a PDU, or the number of bytes it ate from the raw network data to completely assemble a PDU, in which case, queue it off, depool another one and call loadPDU() with the unused remainder of the raw data, then continue with the next raw data to come in.

    You can use different pools of different derived buffer classes to serve different protocols, if needed - an array of TbufferPool[Eprotocols]. TbufferPool could just be a BlockingCollection queue. Management becomes almost trivial - the buffers can be sent on queues all round your system, to a GUI to display stats, then perhaps to a logger, as long as, at the end of the chain of queues, something calls release().

    Obviously, a 'real' PDU object would have loads more methods, data unions/structs, iterators maybe and a state-engine to operate the protocol, but that's the basic idea anyway. The main thing is easy management, encapsulation and, since no two threads can ever operate on the same buffer instance, no lock/synchro required to parse/access the data.

    Oh, yes, and since no queue has to remain locked for longer than required to push/pop one pointer, the chances of actual contention are very low - even conventional blocking queues would hardly ever need to use kernel locking.