Performant Entity Serialization: BSON vs MessagePack (vs JSON)
Asked Answered
O

6

158

Recently I've found MessagePack, an alternative binary serialization format to Google's Protocol Buffers and JSON which also outperforms both.

Also there's the BSON serialization format that is used by MongoDB for storing data.

Can somebody elaborate the differences and the dis-/advantages of BSON vs MessagePack?


Just to complete the list of performant binary serialization formats: There are also Gobs which are going to be the successor of Google's Protocol Buffers. However in contrast to all the other mentioned formats those are not language-agnostic and rely on Go's built-in reflection there are also Gobs libraries for at least on other language than Go.

Opposable answered 15/6, 2011 at 9:14 Comment(3)
Seems mostly like a load of marketing hype. The performance of a ["compiled"] serialization format is due to the implementation used. While some formats have inherently more overhead (e.g. JSON as it's all dynamically processed), formats themselves do not "have a speed". The page then goes on to "pick and choose" how it compares itself ... it a very non-unbiased fashion. Not my cup of tea.Stramonium
Correction: Gobs aren't intended to replace Protocol Buffers, and probably never will. Also, Gobs are language agnostic (they can be read/written in any language, see code.google.com/p/libgob), but they are defined to closely match how Go deals with data, so they work best with Go.Blagoveshchensk
Link to msgpack performance benchmarks is broken (msgpack.org/index/speedtest.png).Squint
A
228

// Please note that I'm author of MessagePack. This answer may be biased.

Format design

  1. Compatibility with JSON

    In spite of its name, BSON's compatibility with JSON is not so good compared with MessagePack.

    BSON has special types like "ObjectId", "Min key", "UUID" or "MD5" (I think these types are required by MongoDB). These types are not compatible with JSON. That means some type information can be lost when you convert objects from BSON to JSON, but of course only when these special types are in the BSON source. It can be a disadvantage to use both JSON and BSON in single service.

    MessagePack is designed to be transparently converted from/to JSON.

  2. MessagePack is smaller than BSON

    MessagePack's format is less verbose than BSON. As the result, MessagePack can serialize objects smaller than BSON.

    For example, a simple map {"a":1, "b":2} is serialized in 7 bytes with MessagePack, while BSON uses 19 bytes.

  3. BSON supports in-place updating

    With BSON, you can modify part of stored object without re-serializing the whole of the object. Let's suppose a map {"a":1, "b":2} is stored in a file and you want to update the value of "a" from 1 to 2000.

    With MessagePack, 1 uses only 1 byte but 2000 uses 3 bytes. So "b" must be moved backward by 2 bytes, while "b" is not modified.

    With BSON, both 1 and 2000 use 5 bytes. Because of this verbosity, you don't have to move "b".

  4. MessagePack has RPC

    MessagePack, Protocol Buffers, Thrift and Avro support RPC. But BSON doesn't.

These differences imply that MessagePack is originally designed for network communication while BSON is designed for storages.

Implementation and API design

  1. MessagePack has type-checking APIs (Java, C++ and D)

    MessagePack supports static-typing.

    Dynamic-typing used with JSON or BSON are useful for dynamic languages like Ruby, Python or JavaScript. But troublesome for static languages. You must write boring type-checking codes.

    MessagePack provides type-checking API. It converts dynamically-typed objects into statically-typed objects. Here is a simple example (C++):

    #include <msgpack.hpp>

    class myclass {
    private:
        std::string str;
        std::vector<int> vec;
    public:
        // This macro enables this class to be serialized/deserialized
        MSGPACK_DEFINE(str, vec);
    };

    int main(void) {
        // serialize
        myclass m1 = ...;

        msgpack::sbuffer buffer;
        msgpack::pack(&buffer, m1);

        // deserialize
        msgpack::unpacked result;
        msgpack::unpack(&result, buffer.data(), buffer.size());

        // you get dynamically-typed object
        msgpack::object obj = result.get();

        // convert it to statically-typed object
        myclass m2 = obj.as<myclass>();
    }
  1. MessagePack has IDL

    It's related to the type-checking API, MessagePack supports IDL. (specification is available from: http://wiki.msgpack.org/display/MSGPACK/Design+of+IDL)

    Protocol Buffers and Thrift require IDL (don't support dynamic-typing) and provide more mature IDL implementation.

  2. MessagePack has streaming API (Ruby, Python, Java, C++, ...)

    MessagePack supports streaming deserializers. This feature is useful for network communication. Here is an example (Ruby):

    require 'msgpack'

    # write objects to stdout
    $stdout.write [1,2,3].to_msgpack
    $stdout.write [1,2,3].to_msgpack

    # read objects from stdin using streaming deserializer
    unpacker = MessagePack::Unpacker.new($stdin)
    # use iterator
    unpacker.each {|obj|
      p obj
    }
Absorbing answered 15/6, 2011 at 11:31 Comment(10)
How does MessagePack compare with Google Protobufs in terms of data size, and consequently, over the air performance?Si
The first point glosses over the fact that MessagePack has raw bytes capability which cannot be represented in JSON. So its just the same as BSON in that regard...Impignorate
@lttlrck Generally, the raw bytes are assumed to be a string (usually utf-8), unless otherwise expected and agreed to on both sides of the channel. msgpack is used as a stream/serialization format... and less verbose that json.. though also less human readable.Lacking
"MessagePack has type-checking APIs. BSON Doesn't." Not entirely accurate. This is actually true for BSON implementations in statically typed languages as well.Enharmonic
MessagePack now has a BINARY data type so the argument of 1-1 de-serialization compatibility to JSON is not entirely true anymore.Unspoken
On the first point, BSON can be bidirectionally converted to JSON without issue. (It's covered in the manual.) Two and three are directly related, i.e. it stores full-size (unpacked) integers to avoid needing to move data later, strings are stored both as Pascal (length-prefixed) and C (null-suffixed), etc. Additional savings are at the discretion of the application: MongoDB uses configurable page compression (snappy, zlib, etc.) for example. On the RPC front, BSON is used for MongoDB's wire protocol and does streaming and RPC quite well.Andie
There's one last feature that people often overlook: properly utilized, you don't need to "deserialize" your BSON data at all. You can instead map it directly to a C structure (with appropriate padding) and directly access the contents.Andie
Java question re: your MessagePack: What if I have an existing serialization setup -- that I don't want to change -- that already gives me a JSON string and I just want to use MessagePack to crunch that string down so it's smaller on the wire and then use JavaScript MessagePack to uncrunch it back to a JavaScript JSON string and pass that to my existing deserialization setup? Can MessagePack do that? I realize this means not using lots of potentially cool features, but we only need the compression aspect (we have tight length limits).Climate
Can someone please tell me what an 'RPC' is?Confiding
RPC stands for "Remote Procedure Call"Tinnitus
L
18

I think it's very important to mention that it depends on what your client/server environment look like.

If you are passing bytes multiple times without inspection, such as with a message queue system or streaming log entries to disk, then you may well prefer a binary encoding to emphasize the compact size. Otherwise it's a case by case issue with different environments.

Some environments can have very fast serialization and deserialization to/from msgpack/protobuf's, others not so much. In general, the more low-level the language/environment the better binary serialization will work. In higher level languages (node.js, .Net, JVM) you will often see that JSON serialization is actually faster. The question then becomes is your network overhead more or less constrained than your memory/cpu?

With regards to msgpack vs bson vs protocol buffers... msgpack is the least bytes of the group, protocol buffers being about the same. BSON defines more broad native types than the other two, and may be a better match to your object model, but this makes it more verbose. Protocol buffers have the advantage of being designed to stream... which makes it a more natural format for a binary transfer/storage format.

Personally, I would lean towards the transparency that JSON offers directly, unless there is a clear need for lighter traffic. Over HTTP with gzipped data, the difference in network overhead are even less of an issue between the formats.

Lacking answered 30/7, 2013 at 21:34 Comment(2)
Native MsgPack is only efficient with ProtocolBuffers size-wise as the length of the keys (which are always-present text) are short such as "a" or "b" - or are otherwise an insignificant part of the entire payload. They are always short in ProtocolBuffers which uses an IDL/compile to map field descriptors to ids. This is also what makes MsgPack "dynamic", which ProtocolBuffers is most certainly not ..Alixaliza
The end point is good though: gzip/deflate are really good are handling redundancy of keys in cases where such keys are "longer but repeated alot" (MsgPack, JSON/BSON, and XML, etc over many records) but won't help ProtocolBuffers at all here.. Avro does key redundancy elimination manually by transmitting the schema separately.Alixaliza
C
12

Well,as the author said,MessagePack is originally designed for network communication while BSON is designed for storages.

MessagePack is compact while BSON is verbose. MessagePack is meant to be space-efficient while BSON is designed for CURD (time-efficient).

Most importantly, MessagePack's type system (prefix) follow Huffman encoding, here I drawed a Huffman tree of MessagePack(click link to see image):

Huffman Tree of MessagePack

Clayborne answered 19/8, 2019 at 3:36 Comment(0)
M
5

A key difference not yet mentioned is that BSON contains size information in bytes for the entire document and further nested sub-documents.

document    ::=     int32 e_list

This has two major benefits for restricted environments (e.g. embedded) where size and performance is important.

  1. You can immediately check if the data you're going to parse represents a complete document or if you're going to need to request more at some point (be it from some connection or storage). Since this is most likely an asynchronous operation you might already send a new request before parsing.
  2. Your data might contain entire sub-documents with irrelevant information for you. BSON allows you to easily traverse to the next object past the sub-document by using the size information of the sub-document to skip it. msgpack on the other hands contains the number of elements inside whats called a map (similar to BSON's sub-documents). While this is undoubtedly useful information it doesn't help the parser. You'd still have to parse every single object inside the map and can't just skip it. Depending on the structure of your data this might have a huge impact on performance.
Misbecome answered 3/4, 2019 at 7:36 Comment(0)
S
2

I made quick benchmark to compare encoding and decoding speed of MessagePack vs BSON. BSON is faster at least if you have large binary arrays:

BSON writer: 2296 ms (243487 bytes)
BSON reader: 435 ms
MESSAGEPACK writer: 5472 ms (243510 bytes)
MESSAGEPACK reader: 1364 ms

Using C# Newtonsoft.Json and MessagePack by neuecc:

    public class TestData
    {
        public byte[] buffer;
        public bool foobar;
        public int x, y, w, h;
    }

    static void Main(string[] args)
    {
        try
        {
            int loop = 10000;

            var buffer = new TestData();
            TestData data2;
            byte[] data = null;
            int val = 0, val2 = 0, val3 = 0;

            buffer.buffer = new byte[243432];

            var sw = new Stopwatch();

            sw.Start();
            for (int i = 0; i < loop; i++)
            {
                data = SerializeBson(buffer);
                val2 = data.Length;
            }

            var rc1 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data2 = DeserializeBson(data);
                val += data2.buffer[0];
            }
            var rc2 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data = SerializeMP(buffer);
                val3 = data.Length;
                val += data[0];
            }

            var rc3 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < loop; i++)
            {
                data2 = DeserializeMP(data);
                val += data2.buffer[0];
            }
            var rc4 = sw.ElapsedMilliseconds;

            Console.WriteLine("Results:", val);
            Console.WriteLine("BSON writer: {0} ms ({1} bytes)", rc1, val2);
            Console.WriteLine("BSON reader: {0} ms", rc2);
            Console.WriteLine("MESSAGEPACK writer: {0} ms ({1} bytes)", rc3, val3);
            Console.WriteLine("MESSAGEPACK reader: {0} ms", rc4);
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
        }

        Console.ReadLine();
    }

    static private byte[] SerializeBson(TestData data)
    {
        var ms = new MemoryStream();

        using (var writer = new Newtonsoft.Json.Bson.BsonWriter(ms))
        {
            var s = new Newtonsoft.Json.JsonSerializer();
            s.Serialize(writer, data);
            return ms.ToArray();
        }
    }

    static private TestData DeserializeBson(byte[] data)
    {
        var ms = new MemoryStream(data);

        using (var reader = new Newtonsoft.Json.Bson.BsonReader(ms))
        {
            var s = new Newtonsoft.Json.JsonSerializer();
            return s.Deserialize<TestData>(reader);
        }
    }

    static private byte[] SerializeMP(TestData data)
    {
        return MessagePackSerializer.Typeless.Serialize(data);
    }

    static private TestData DeserializeMP(byte[] data)
    {
        return (TestData)MessagePackSerializer.Typeless.Deserialize(data);
    }
Servomechanism answered 29/4, 2019 at 12:55 Comment(0)
R
1

Quick test shows minified JSON is deserialized faster than binary MessagePack. In the tests Article.json is 550kb minified JSON, Article.mpack is 420kb MP-version of it. May be an implementation issue of course.

MessagePack:

//test_mp.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.mpack');

for (var i = 0; i < 10000; i++) {
    msg.unpack(article);    
}

JSON:

// test_json.js
var msg = require('msgpack');
var fs = require('fs');

var article = fs.readFileSync('Article.json', 'utf-8');

for (var i = 0; i < 10000; i++) {
    JSON.parse(article);
}

So times are:

Anarki:Downloads oleksii$ time node test_mp.js 

real    2m45.042s
user    2m44.662s
sys     0m2.034s

Anarki:Downloads oleksii$ time node test_json.js 

real    2m15.497s
user    2m15.458s
sys     0m0.824s

So space is saved, but faster? No.

Tested versions:

Anarki:Downloads oleksii$ node --version
v0.8.12
Anarki:Downloads oleksii$ npm list msgpack
/Users/oleksii
└── [email protected]  
Reposit answered 5/11, 2012 at 22:34 Comment(8)
Definitely depends on the implementations. My tests with Python 2.7.3 unpacking a 489K test.json (409K equivalent test.msgpack) show that for 10,000 iterations simplejson 2.6.2 takes 66.7 seconds and msgpack 0.2.2 takes just 28.8.Zelazny
Where did this Article.json come from?Yam
folks, test code is in my comment above, what else did you expect, Article.json is json-serialized object from our project. And by now those results might be irrelevant anywaysReposit
Also worth noting is if you gzip the data, how close are the sizes then... And it very much does depend on the environment and efficiency of the JSON handling. JS has a particularly good JSON serializer.Lacking
This is not a fair performance comparison, as JS has JSON implemented natively in C++, while msgpack in JS.Ul
The msgpack organization put this github.com/kawanet/msgpack-lite as the recommended JavaScript implementation on their home page. It looks fast. Also don't forget that (de)serialization performance is not the only use case for msgpack, its other main advantage is the messages size of course.Chickenlivered
You are trying to make MessagePack talk Latin better than Romans. JSON is native (C++) to JavaScript while MessagePack is written in JavaScript, which is interpreted. This is basically comparing two code snippets, one written in JavaScript and other is written in C++.Deserving
Further, regardless of implementation, it's nearly impossible for the most performant (possible) implementation of JSON to deserialize faster than the most performant (possible) implementation MsgPack.Byelorussian

© 2022 - 2024 — McMap. All rights reserved.