Protobuf vs Flatbuffers vs Cap'n proto which is faster?

I decided to figure out which of Protobuf, Flatbuffers and Cap'n proto would be the best/fastest serialization for my application. In my case sending some kind of byte/char array over a network (the reason I serialized to that format). So I made simple implementations for all three where i seialize and dezerialize a string, a float and an int. This gave unexpected resutls: Protobuf being the fastest. I would call them unexpected since both cap'n proto and flatbuffes "claims" to be faster options. Before I accept this I would like to see if I unitentionally cheated in my code somehow. If i did not cheat I would like to know why protobuf is faster (exactly why is probably impossible). Could the messages be to simeple for cap'n proto and faltbuffers to really make them shine?

My timings:

Time taken flatbuffers: 14162 microseconds
Time taken capnp: 60259 microseconds
Time taken protobuf: 12131 microseconds
(time from one machine. Relative comparison might be relevant.)

UPDATE: The above numbers are not representative of CORRECT usage, at least not for capnp -- see answers & comments.

flatbuffer code:

int main (int argc, char *argv[]){
    std::string s = "string";
    float f = 3.14;
    int i = 1337;

    std::string s_r;
    float f_r;
    int i_r;
    flatbuffers::FlatBufferBuilder message_sender;
    
    int steps = 10000;
    auto start = high_resolution_clock::now(); 
    for (int j = 0; j < steps; j++){
        auto autostring =  message_sender.CreateString(s);
        auto encoded_message = CreateTestmessage(message_sender, autostring, f, i);
        message_sender.Finish(encoded_message);
        uint8_t *buf = message_sender.GetBufferPointer();
        int size = message_sender.GetSize();
        message_sender.Clear();
        //Send stuffs
        //Receive stuffs
        auto recieved_message = GetTestmessage(buf);

        s_r = recieved_message->string_()->str();
        f_r = recieved_message->float_();
        i_r = recieved_message->int_(); 
    }
    auto stop = high_resolution_clock::now(); 
    auto duration = duration_cast<microseconds>(stop - start); 
    cout << "Time taken flatbuffer: " << duration.count() << " microseconds" << endl;
    return 0;
}

cap'n proto code:

int main (int argc, char *argv[]){
    char s[] = "string";
    float f = 3.14;
    int i = 1337;

    const char * s_r;
    float f_r;
    int i_r;
    ::capnp::MallocMessageBuilder message_builder;
    Testmessage::Builder message = message_builder.initRoot<Testmessage>();

    int steps = 10000;
    auto start = high_resolution_clock::now(); 
    for (int j = 0; j < steps; j++){  
        //Encodeing
        message.setString(s);
        message.setFloat(f);
        message.setInt(i);

        kj::Array<capnp::word> encoded_array = capnp::messageToFlatArray(message_builder);
        kj::ArrayPtr<char> encoded_array_ptr = encoded_array.asChars();
        char * encoded_char_array = encoded_array_ptr.begin();
        size_t size = encoded_array_ptr.size();
        //Send stuffs
        //Receive stuffs

        //Decodeing
        kj::ArrayPtr<capnp::word> received_array = kj::ArrayPtr<capnp::word>(reinterpret_cast<capnp::word*>(encoded_char_array), size/sizeof(capnp::word));
        ::capnp::FlatArrayMessageReader message_receiver_builder(received_array);
        Testmessage::Reader message_receiver = message_receiver_builder.getRoot<Testmessage>();
        s_r = message_receiver.getString().cStr();
        f_r = message_receiver.getFloat();
        i_r = message_receiver.getInt();
    }
    auto stop = high_resolution_clock::now(); 
    auto duration = duration_cast<microseconds>(stop - start); 
    cout << "Time taken capnp: " << duration.count() << " microseconds" << endl;
    return 0;

}

protobuf code:

int main (int argc, char *argv[]){
    std::string s = "string";
    float f = 3.14;
    int i = 1337;

    std::string s_r;
    float f_r;
    int i_r;
    Testmessage message_sender;
    Testmessage message_receiver;
    int steps = 10000;
    auto start = high_resolution_clock::now(); 
    for (int j = 0; j < steps; j++){
        message_sender.set_string(s);
        message_sender.set_float_m(f);
        message_sender.set_int_m(i);
        int len = message_sender.ByteSize();
        char encoded_message[len];
        message_sender.SerializeToArray(encoded_message, len);
        message_sender.Clear();

        //Send stuffs
        //Receive stuffs
        message_receiver.ParseFromArray(encoded_message, len);
        s_r = message_receiver.string();
        f_r = message_receiver.float_m();
        i_r = message_receiver.int_m();
        message_receiver.Clear();
       
    }
    auto stop = high_resolution_clock::now(); 
    auto duration = duration_cast<microseconds>(stop - start); 
    cout << "Time taken protobuf: " << duration.count() << " microseconds" << endl;
    return 0;
}

not including the message definition files scince they are simple and most likely has nothing to do with it.

kj::ArrayPtr<const kj::ArrayPtr<const capnp::word>> segments = message_builder.getSegmentsForOutput(); // Send segments // Receive segments capnp::SegmentArrayMessageReader message_receiver_builder(segments);

On benchmarks

I've spent a lot of time benchmarking Protobuf and Cap'n Proto. One thing I've learned in the process is that most simple benchmarks you can create will not give you realistic results.

First, any serialization format (even JSON) can "win" given the right benchmark case. Different formats will perform very, very differently depending on the content. Is it string-heavy, number-heavy, or object heavy (i.e. with deep message trees)? Different formats have different strengths here (Cap'n Proto is incredibly good at numbers, for example, because it doesn't transform them at all; JSON is incredibly bad at them). Is your message size incredibly short, medium-length, or very large? Short messages will mostly exercise the setup/teardown code rather than body processing (but setup/teardown is important -- sometimes real-world use cases involve lots of small messages!). Very large messages will bust the L1/L2/L3 cache and tell you more about memory bandwidth than parsing complexity (but again, this is important -- some implementations are more cache-friendly than others).

Even after considering all that, you have another problem: Running code in a loop doesn't actually tell you how it performs in the real world. When run in a tight loop, the instruction cache stays hot and all the branches become highly predictable. So a branch-heavy serialization (like protobuf) will have its branching cost swept under the rug, and a code-footprint-heavy serialization (again... like protobuf) will also get an advantage. This is why micro-benchmarks are only really useful to compare code against other versions of itself (e.g. to test minor optimizations), NOT to compare completely different codebases against each other. To find out how any of this performs in the real world, you need to measure a real-world use case end-to-end. But... to be honest, that's pretty hard. Few people have the time to build two versions of their whole app, based on two different serializations, to see which one wins...

On benchmarks

Recommended topics

Hot tags