messagebrokernats.ionats-streaming-server

NATS Core + Nats C Client message loss


I suffer message loss with NATS publish (Core, not Jetstream yet).

Using NATS CLI on Windows to subscribe as sub ">"

Using NATS server on Linux Ubuntu on local LAN.

Application on Windows using NATS C Client (latest GitHub version).

The following code reproduces the problem (Possibly on on FAST CPUs at the client side. I used AMD Threadripper 16, 32 and 64 Cores and Intel i7-10810U, they all have it).

The problem occurs already with a SINGLE message, idle network and NATS server dedicated to this test, hence no other traffic or heavy load on NATS server.

You need to provide a logged on connection to this method (code not shown, contains my keys). Select test with 'case' 1,2 or 3 to see the different scenarios and workaround.

   #include "nats.h"

natsStatus PublishIt(natsConnection* nc) {

// Create subject
std::string subject = "test";

// Fill a buffer with data to send
char* buf = new char[1024];
int len = sprintf(buf, "This is a reliability test to see if NATS looses messages on fast systems and if possibly the provided buffer is cloned after the natsConnection_Publish() function allready returned. If that is the case it would explain NATS high performance but while being unreliable depending on the underlying CPU speed and thread-lottery.");

// Publish
natsStatus nstat = natsConnection_Publish(nc, subject.c_str(), (const void*) buf, len);
if (nstat != NATS_OK) { printf("natsConnection_Publish() Failed"); return nstat; }  // <<< Never failed

// Select the test according remarks next to the 'case' statements.
int selectTest = 3;
switch (selectTest)
{
    case 1: // This looses messages. NATS CLI doesn't display

        delete[] buf;
        break;

    case 2: // This is a memory leak BUT NEVER looses any message and above text appears on NATS CLI
            // Will eventually run out of memory of course and isn't an acceptable solution.
            
        // do nothing, just don't delete buf[]
        break;

    case 3: // This is a workaround that doesn't loose messages and NATS CLI shows text BUT it looses performance.

        nstat = natsConnection_Flush(nc);
        if (nstat != NATS_OK) printf("NATS Flush Failed: %i", nstat); // <<< Flush never failed.
        delete[] buf;
        break;
}
return nstat;}

Is there anyone that has a better solution than the flush() above. Something tells me that in an even faster CPU, or if core dedication would become possible, this workaround is not going to hold. My reasoning is that the flush() just creates sufficient time for some underlying async. action to consume the buffer before it is deleted.

I tried with a single flush() with 2 sec timeout just before disconnecting, but that doesn't work. The flush must be between the publish call and the deletion of the buffer. And that means it must be called on EVERY SINGLE publish, which is a performance problem.

The documentation at http://nats-io.github.io/nats.c/group__conn_pub_group.html#gac0b9f7759ecc39b8d77807b94254f9b4 doesn't say anything about whether caller needs to relinquish the buffer, hence I delete it. Maybe there is other documentation but the above one claims to be the official one.

Thanks for any additional information.


Solution

  • After some more tests and some good information from the Nats C team the following answers the question.

    1. A final flush(), just before disconnect can work, but on fast CPU's it is best to build a little std::thread_sleep between the flush and the disconnect. The flush is an async function and needs a little moment to execute while the connection remains open.

    2. A flush after each publish is not needed as long as the publish statements keep filling the buffer. So a first, single, message or the last message in a set might need flushing. In practice knowing the 'last' message might be difficult in some situations.

    3. A flush with a timer, one that resets after ever publish, might solve the problem of some apparent long lingering too.

    With the above means NATS Core remains very performant and doesn't loose any messages anymore.

    Possibly the one liner to remember is: Flush is Async.

    Thanks to the NATS team for the help on GitHub.