Storing data in Protobuf (vs JSON)?

S

3

5

I've seen many articles about "Protobuf vs JSON". But none of them ever mentioned the data-storing aspect. AFAIK, one can store data in JSON (or XML). But can the same thing be done in Protobuf?

If it cannot be done, how can I store data? Is Protobuf made strictly for the purpose of being an interface definition language only?

In the "Protobuf vs JSON" articles I saw, the example JSON files have complete key-value pairs, while the example Protobuf files only have the message fields without any value. Personally I find that serve to complicated things during a comparison, since the JSON and Protobuf files look like they are serving quite different purposes, therefore cannot be properly compared. Or it is just me not being able to understand the big picture.

One last thing, it is said that Protobut is binary, but what exactly does that mean? What is being binary?

Please help me, thank you!

Siesta answered 21/6, 2021 at 7:17 Comment(1)

Protocol buffers describe how data is layed out so receiver knows how to parse it correctly. – Zaxis 21/6, 2021 at 7:35

N

19

Protobuf, JSON, XML, ASN.1, etc (there's a loooong list) are all ways of representing information. The idea is that if your program is given a file, or network stream, or memory buffer, that is in "Protobuf wire format", you'd use the schema for that file and the tools/libraries/code that Google publish to interpret it into an object (or structure) that your program can use.

The reason to do so is to overcome the problem that different computers / OSes / language types store data in different ways. Suppose you've got an object in a program you've written in C++ on 64bit Linux on an X86/64. The way that object is stored in memory is not compatible with how Java would store the equivalent object. So if the C++ program wrote the content of memory to file (or network stream, or whatever), the Java program couldn't easily read it.

Protobuf (and the rest) solves this problem by abstracting how information is stored and providing a set of tools for a variety of different programming languages and operating systems that understand that abstraction. This makes it a whole lot easier to get information from, say, a program written in C++ into a program written in Java.

"Schema First" and "Code First"

An important concept to understand is the difference between "schema first" and "code first" approaches. With Protobuf, ASN.1, XML/XSD, one starts off writing a schema, which is compiled to source code (C, Java, whatever), which you combine with your own program source code, compile, link. The result is your program that can read and write messages (whose structure you originally defined in the schema file) in protobuf "wire format".

Quite a lot of programming languages provide a "code first" option, e.g. C#, Java, Boost, etc. Here you're annotating your own class definitions to describe to the compiler how the class should be serialised. These too may allow you to produce files that may be encoded in JSON, or XML, or something, but you cannot so easily exchange the messages with another program written in a different language (because that language's compiler cannot understand the original class definitions).

Generally speaking, the "wire format" of different serialisations are not compatible; a program emitting protobuf wire format will not be understood by another program expecting to receive ASN.1's BER wire format.

Summary:

Serialisation is a way of abstracting the information stored in objects in your program in a way that another program in a different language can also understand. On the whole, the exact detail of how the information is represented in the wire format is something that neither the developer or user actually cares about, so long as there is agreement on which wire format should be used.

Storing Data

All of these are simply ways of representing information as a set of bytes that conform to a "wire format" standard. What one does with those bytes is up to you! You can store them in a file, send them down a network stream, share them in memory.

Here's where it can get interesting though. Different wire formats may, or may not, be self-demarcating. Consider JSON, which contains a lot of { and }. Correctly formatted JSON will have a balanced number, and ultimately there is the outer { and }. What this means is that a program reading the JSON bytes can tell when it's read enough to have got a complete message - it gets the closing }. So it's perfectly feasible to have a number of JSON messages in a file, and the reading program can split them apart and make sense of them.

However, protobuf's wire format does not do this. You have to have some other means of separating the messages (e.g. 1 file per message). If you were having to send protobuf messages down a network connection, using ZeroMQ might be a good idea (because that transports and demarcates messages, not a byte stream).

These aspects can influence which serialisation technology you'd want to use.

Binary vs Text

This basically refers to whether or not the "wire format" is readable text or not. JSON, XML, and a whole heap of others are readable text (though not necessarily trivial for a human being to fully absorb, especially XML!). Protobuf, ASN.1 BER/PER are binary.

The difference tends to be efficiency. Binary representations of information take up less storage. Text representations take up more. For example, the number 1,435,345,456 takes 10 bytes of storage as text, but only 4 as binary. Floating point numbers are even worse in text.

In fact, the reason Google created Protobuf was to get away from using XML. The irony of this was that, having identified that getting away from XML would save them a fortune in storage costs, they decided to invent Protobuf. Had they done just a little bit of googling, they'd have come across ASN.1 uPER, which is even more efficient, already existed, was already well standardised. The Google presenter at the conference where GPB was announced to the world confessed in the conference to never having heard of ASN.1, a technology that has existed since the 1980s and is a huge component of telephony, internet, and is very current.

Binary is faster, too. It takes less time to encode an integer field in an object as 4 bytes than it does 10 bytes.

So, What's Missing?

What's missing is constraints.

Suppose you have a message that contains a field which represents bearing. Now, you might want that to be limited to between 0 and 359. With protobuf, there's nothing you can do in the .proto schema file to say that it should be 0 to 359, except write a comment and hope that all developers read that comment. This is very unreliable.

What ASN.1, XML and JSON schema allow is to say that the bearing field is constrained in value, making it's validation simple and/or automatic. In practice, the benefit is variable.

The ASN.1 tools generally do a good job with constraints (which can get very elaborate in ASN.1).

There's not many tools that will consume XML/XSD schema that pay any attention to the constraints one can define in the XSD file (e.g. Microsoft's xsd.exe is awful).

JSON validators do (so far as I know) pay attention to the constraints one can put in a JSON schema. BTW, JSON schema seem to mostly be used for validating JSON data, rather than being used as input to a source code generator which builds the constraints check into the generated code.

Quick Note about ASN.1

ASN.1 is the grandfather of all serialisation technologies, and, in my opinion, is the only one that is "complete". It does both binary and text wire formats. It does very compact and verbose binary formats. The text formats encompass XML and JSON (yep, the ancient 1980's ASN.1 standard has been updated to incorporate XML and JSON). It does constraints on both value ranges and array sizes. The uPER binary wire format uses those constraints to further reduce data sizes. For example, if an integer field is constrained between 0 and 15, it will use only 4 bits to store it. It's schema allows the definitions of values as well as message types, meaning that more "system" information can be defined in a single place.

The only pity is that the good tools cost money, and Google preferred to blow way more money building something new / worse than it would have cost to do a good OSS ASN.1 compiler.

JSON

AFAIK, JSON schema are predominately used to validate JSON data. Whilst useful, this is not as useful as it being input for a source code generator. This is because, used as a validator, you still have to write the code that generates the JSON data in the first place. I know this isn't a problem in JSON's origins - a JSON file is JavaScript - but it would be nice if someone did a proper code generator for other languages.

WebASM

This is making life interesting. Where once upon a time JSON ruled the web, for everything was JavaScript, the same is no longer true. It is perfectly possible these days to take code generated by an ASN.1 compiler for C, and build that into a program running as a Web assemebly. Thus you can have JSON emitted by a web assembly that was compiled from C using a standard that first originated in the 1980s.

Nearby answered 21/6, 2021 at 8:10 Comment(1)

+1 for let me know ASN.1 and "For example, the number 1,435,345,456 takes 10 bytes of storage as text, but only 4 as binary. Floating point numbers are even worse in text.", which I missed. – Ecru 29/12, 2021 at 9:48

D

1

It’s up to you/app in what format to store data.

A lot of companies store data serialized to protobuf. There is pros & cons for doing it. Sometime same data is stored in different formats - for example JSON in client browser and protobuf on server.

Binary sterilization is more compact and quicker to parse but harder for human reading.

You can’t compare easily content of different “records” serialized in different formats but you can compare object deserialized from those “records”.

Dualistic answered 21/6, 2021 at 7:27 Comment(2)

You said "Sometime same data is stored in different formats - for example JSON in client browser and protobuf on server.". How to store data in protobuf? – Siesta 21/6, 2021 at 7:52

You serialize data using protobuf library. There is different implementations for different languages - developers.google.com/protocol-buffers – Dualistic 21/6, 2021 at 13:20

B

0

One last thing, it is said that Protobut is binary, but what exactly does that mean? What is being binary?

All that means is "not text"; the format used isn't intended to be human readable. This is usually more efficient, both in terms of space used and processing speed.

But can the same thing be done in Protobuf?

Absolutely; just store the blob; this could be as a file, or as a binary field in any suitable database, etc. If the thing you're storing to can only store text, you could base-64 encode it, but that adds a step and size.

Buckeen answered 21/6, 2021 at 8:26 Comment(0)

Recommended topics

Hot tags