c++protocol-buffers flatbuffers capnproto

Protobuf vs Flatbuffers vs Cap'n proto which is faster?

I decided to figure out which of Protobuf, Flatbuffers and Cap'n proto would be the best/fastest serialization for my application. In my case sending some kind of byte/char array over a network (the reason I serialized to that format). So I made simple implementations for all three where i seialize and dezerialize a string, a float and an int. This gave unexpected resutls: Protobuf being the fastest. I would call them unexpected since both cap'n proto and flatbuffes "claims" to be faster options. Before I accept this I would like to see if I unitentionally cheated in my code somehow. If i did not cheat I would like to know why protobuf is faster (exactly why is probably impossible). Could the messages be to simeple for cap'n proto and faltbuffers to really make them shine?

My timings:

Time taken flatbuffers: 14162 microseconds
Time taken capnp: 60259 microseconds
Time taken protobuf: 12131 microseconds
(time from one machine. Relative comparison might be relevant.)

UPDATE: The above numbers are not representative of CORRECT usage, at least not for capnp -- see answers & comments.

flatbuffer code:

int main (int argc, char *argv[]){
    std::string s = "string";
    float f = 3.14;
    int i = 1337;

    std::string s_r;
    float f_r;
    int i_r;
    flatbuffers::FlatBufferBuilder message_sender;
    
    int steps = 10000;
    auto start = high_resolution_clock::now(); 
    for (int j = 0; j < steps; j++){
        auto autostring =  message_sender.CreateString(s);
        auto encoded_message = CreateTestmessage(message_sender, autostring, f, i);
        message_sender.Finish(encoded_message);
        uint8_t *buf = message_sender.GetBufferPointer();
        int size = message_sender.GetSize();
        message_sender.Clear();
        //Send stuffs
        //Receive stuffs
        auto recieved_message = GetTestmessage(buf);

        s_r = recieved_message->string_()->str();
        f_r = recieved_message->float_();
        i_r = recieved_message->int_(); 
    }
    auto stop = high_resolution_clock::now(); 
    auto duration = duration_cast<microseconds>(stop - start); 
    cout << "Time taken flatbuffer: " << duration.count() << " microseconds" << endl;
    return 0;
}

cap'n proto code:

int main (int argc, char *argv[]){
    char s[] = "string";
    float f = 3.14;
    int i = 1337;

    const char * s_r;
    float f_r;
    int i_r;
    ::capnp::MallocMessageBuilder message_builder;
    Testmessage::Builder message = message_builder.initRoot<Testmessage>();

    int steps = 10000;
    auto start = high_resolution_clock::now(); 
    for (int j = 0; j < steps; j++){  
        //Encodeing
        message.setString(s);
        message.setFloat(f);
        message.setInt(i);

        kj::Array<capnp::word> encoded_array = capnp::messageToFlatArray(message_builder);
        kj::ArrayPtr<char> encoded_array_ptr = encoded_array.asChars();
        char * encoded_char_array = encoded_array_ptr.begin();
        size_t size = encoded_array_ptr.size();
        //Send stuffs
        //Receive stuffs

        //Decodeing
        kj::ArrayPtr<capnp::word> received_array = kj::ArrayPtr<capnp::word>(reinterpret_cast<capnp::word*>(encoded_char_array), size/sizeof(capnp::word));
        ::capnp::FlatArrayMessageReader message_receiver_builder(received_array);
        Testmessage::Reader message_receiver = message_receiver_builder.getRoot<Testmessage>();
        s_r = message_receiver.getString().cStr();
        f_r = message_receiver.getFloat();
        i_r = message_receiver.getInt();
    }
    auto stop = high_resolution_clock::now(); 
    auto duration = duration_cast<microseconds>(stop - start); 
    cout << "Time taken capnp: " << duration.count() << " microseconds" << endl;
    return 0;

}

protobuf code:

int main (int argc, char *argv[]){
    std::string s = "string";
    float f = 3.14;
    int i = 1337;

    std::string s_r;
    float f_r;
    int i_r;
    Testmessage message_sender;
    Testmessage message_receiver;
    int steps = 10000;
    auto start = high_resolution_clock::now(); 
    for (int j = 0; j < steps; j++){
        message_sender.set_string(s);
        message_sender.set_float_m(f);
        message_sender.set_int_m(i);
        int len = message_sender.ByteSize();
        char encoded_message[len];
        message_sender.SerializeToArray(encoded_message, len);
        message_sender.Clear();

        //Send stuffs
        //Receive stuffs
        message_receiver.ParseFromArray(encoded_message, len);
        s_r = message_receiver.string();
        f_r = message_receiver.float_m();
        i_r = message_receiver.int_m();
        message_receiver.Clear();
       
    }
    auto stop = high_resolution_clock::now(); 
    auto duration = duration_cast<microseconds>(stop - start); 
    cout << "Time taken protobuf: " << duration.count() << " microseconds" << endl;
    return 0;
}

not including the message definition files scince they are simple and most likely has nothing to do with it.

Solution

In Cap'n Proto, you should not reuse a MessageBuilder for multiple messages. The way you've written your code, every iteration of your loop will make the message bigger, because you're actually adding on to the existing message rather than starting a new one. To avoid memory allocation with each iteration, you should pass a scratch buffer to MallocMessageBuilder's constructor. The scratch buffer can be allocated once outside the loop, but you need to create a new MallocMessageBuilder each time around the loop. (Of course, most people don't bother with scratch buffers and just let MallocMessageBuilder do its own allocation, but if you choose that path in this benchmark, then you should also change the Protobuf benchmark to create a new message object for every iteration rather than reusing a single object.)

Additionally, your Cap'n Proto code is using capnp::messageToFlatArray(), which allocates a whole new buffer to put the message into and copies the entire message over. This is not the most efficient way to use Cap'n Proto. Normally, if you were writing the message to a file or socket, you would write directly from the message's original backing buffer(s) without making this copy. Try doing this instead:

kj::ArrayPtr<const kj::ArrayPtr<const capnp::word>> segments =
    message_builder.getSegmentsForOutput();

// Send segments
// Receive segments

capnp::SegmentArrayMessageReader message_receiver_builder(segments);

Or, to make things more realistic, you could write the message out to a pipe and read it back in, using capnp::writeMessageToFd() and capnp::StreamFdMessageReader. (To be fair, you would need to make the protobuf benchmark write to / read from a pipe as well.)

(I'm the author of Cap'n Proto and Protobuf v2. I'm not familiar with FlatBuffers so I can't comment on whether that code has any similar issues...)

On benchmarks

I've spent a lot of time benchmarking Protobuf and Cap'n Proto. One thing I've learned in the process is that most simple benchmarks you can create will not give you realistic results.

First, any serialization format (even JSON) can "win" given the right benchmark case. Different formats will perform very, very differently depending on the content. Is it string-heavy, number-heavy, or object heavy (i.e. with deep message trees)? Different formats have different strengths here (Cap'n Proto is incredibly good at numbers, for example, because it doesn't transform them at all; JSON is incredibly bad at them). Is your message size incredibly short, medium-length, or very large? Short messages will mostly exercise the setup/teardown code rather than body processing (but setup/teardown is important -- sometimes real-world use cases involve lots of small messages!). Very large messages will bust the L1/L2/L3 cache and tell you more about memory bandwidth than parsing complexity (but again, this is important -- some implementations are more cache-friendly than others).

Even after considering all that, you have another problem: Running code in a loop doesn't actually tell you how it performs in the real world. When run in a tight loop, the instruction cache stays hot and all the branches become highly predictable. So a branch-heavy serialization (like protobuf) will have its branching cost swept under the rug, and a code-footprint-heavy serialization (again... like protobuf) will also get an advantage. This is why micro-benchmarks are only really useful to compare code against other versions of itself (e.g. to test minor optimizations), NOT to compare completely different codebases against each other. To find out how any of this performs in the real world, you need to measure a real-world use case end-to-end. But... to be honest, that's pretty hard. Few people have the time to build two versions of their whole app, based on two different serializations, to see which one wins...