Serializiation with Protocol Buffers in a Schemaless Database
Asked Answered
T

4

7

We're using MySQL to store schemaless data (see: Using a Relational Database for Schemaless Data for the solution inspired by how FriendFeed uses MySQL to store schemaless data).

One big table holds all entities for our application:

CREATE TABLE entities (
  added_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, id BINARY(16) NOT NULL
, body MEDIUMBLOB
, UNIQUE KEY (id)
) ENGINE=InnoDB ;

A few details:

  • The only required property of stored entities is id, a 16-byte UUID. The rest of the entity is opaque to the database. We can change the "schema" simply by storing new properties in the body.

  • The added_id column is present because InnoDB stores data rows physically in primary key order. The AUTO_INCREMENT primary key ensures new entities are written sequentially on disk after old entities, which helps for read/write locality (new entities are read more frequently than old entities).

  • Our database stores our schemaless data in the body. <- This is the topic of this question.

  • Plenty of other interesting details, like "reaching into" the body data to build asynchronous materialized views (indexes are just tables that are built offline), but they're not relevant to the current discussion...

How should we be serializing the structured data (key-value pairs) in the body?

JSON or BSON would be simple, since the field names are repeated for each row. This gives it an advant­age in flex­ib­il­ity but also a big dis­ad­vant­age in space ef­fi­ciency (a per-row over­head for field names in the seri­al­ized data). We're trying to keep things in memory, and minimizing both memory and network footprint is important here. The more records we can fit in the same space, the faster our queries will be. We prefer relatively long, descriptive field names, and shortening them to make my database faster is wrong!

In the end, JSON/BSON is unworkable for our purposes, unless we get more complex and map small keys to more descriptive keys in the application driver that talks to the database. Which got us thinking...

Although our database is schemaless, in reality: 1) there aren't too many different kinds of entities, 2) versions of the same kind of entity don't change often, and 3) when they do change, it's usually just to add another field. JSON/BSON have no native support for versioning.

Protocol Buffers and Thrift are much more sophisticated when it comes to versioning and data definition changes. Both Thrift and Protocol Buffers are great candidates for serializing data into databases, and Thrift is designed so that the encoding format is extensible.

Protocol Buffers look like a great choice for serializing data in a schemaless database.

CouchDB and MongoDB (the two most popular schemaless databases?) use JSON and BSON respectively, but we can't find anything about using something more advanced, like Protocol Buffers, as a serialization format for storing schemaless data. There are products that store a specific language's version of objects (ie. storing Java's Externalizable objects in a datagrid, or doing NoSQL with MySQL in Ruby), but these are a pain (try accessing them from other platforms, or even from MySQL itself, and forget about versioning).

Is anyone storing the more interoperable Protocol Buffers in their database, or some other advanced serialization format in their schemaless database? This is a question of whether there are other options besides straightforward per-row serialization of JSON/BSON/XML, or serializing a specific language's objects. Is it even feasible? Are we missing something? sorry for the stream of consciousness style narrative!

Theta answered 25/11, 2010 at 1:11 Comment(2)
Thanks for the Friendfeed reference and the details in this question. One thing I noticed is that Friendfeed did not use protocol buffers in their schemaless MySQL implementation even though they came from Google... I wonder why? It's been a while since your post -- just wondering what you decided to do and how it turned out for you (especially if you decided to use protocol buffers).Murk
Thanks. I had a very similar ask although it was a bit more generic: #17441928 . I hope that we see something that allows system developers to lay out stricter yet scalable schema on our NoSQL implementations. Protocol Buffers seems like a very good start for designing the schema definition and version control mechanisms.Alienee
E
3

As you found out, MongoDB and CouchDB have strong opinions about how you store your data. If you're looking for a storage agnostic approach, you'll want to do something like @Joshua suggests and look at Cassandra or HBase. Even these two datastores have opinions about how data should be stored (they're both based on Google's Bigtable) and store data in column families.

Riak uses protocol buffers as one method of serializing data from your application into the datastore. It might be worth checking out to see if it fits your needs. It looks like you're largely planning to do single key lookups, so Riak may be a strong contender for your solution.

Elvyn answered 30/11, 2010 at 13:1 Comment(5)
Please read the question. He is asking how to do it with MySQL.Elna
Yes, then i read the rest of the question. Including " Is anyone storing the more interoperable Protocol Buffers in their database, or some other advanced serialization format in their schemaless database?"Elvyn
And so you realized he is asking for some other advanced serialization format in a schemaless MySQL database. When the context of the question has been established, the noun modifier for subjects can be omitted. People who skip to the last paragraph without reading through the whole question will dish out an out-of-context answer like yours.Elna
You are correct. I have failed humanity. I must go now, my people need me to receive additional training in attempting to contribute to pedantry.Elvyn
The training you require is merely to read through the whole question before posting an answer. That's the part you have failed.Elna
L
1

You may want to look into something like Cassandra or HBase for storing your data. The issue with the opaque data blob is that you can't query based on it with your MySQL schema here. If you're looking for something, you'll have to read in every blob and check it. If that's really unimportant to how you're doing lookups (i.e. you always the the key), then I would suggest using protocol buffers to serialize the data, possibly compressing with zlib or LZO compression.

Protocol buffers allow you to create a simple data structure that can accept additional fields as your data evolves. Field names are stored as numbers and code to work with the structures is generated automatically from your .proto file. Performance is good and data sizes are kept quite small. You could optionally compress the data either using the MySQL compress() or one of the real time compression libraries summarized here (not just Java):

Fast compression in Java?

Hope this helps.

Listen answered 25/11, 2010 at 2:36 Comment(3)
He has stated that Cassandra / HBase are not valid options.Elna
He only stated that CouchDB and MongoDB were out because of their use of JSON/BSON. His investigation of NoSQL technologies implies a willingness to switch if the benefits are appropriate.Listen
Actually no. He wants to build a schemaless solution on top of a robust RDBMS, aka achieving NoSQL with MySQL. Read the whole question and go through the links linked.Elna
C
1

PostgreSQL now has a JSON type: http://www.postgresql.org/docs/9.3/static/datatype-json.html

You can make queries where you "reach into" these values.

It should be pretty easy to convert Protobuf to JSON.

Crib answered 6/3, 2014 at 10:22 Comment(1)
This is a MySQL question.Elna
R
0

I'll refer you to an answer I put forward a few months back on a sort-of similar topic. We use MySQL and a custom text format which proved faster than XML or JSON formats:

What scalability problems have you encountered using a NoSQL data store?

Working well for us. Didn't try Protocol Buffers though.

Receptacle answered 30/11, 2010 at 13:7 Comment(5)
You are probably using bad JSON or XML parser -- overhead with good impl should be low enough to make custom formats impractical.Escalade
No I don't think we did, I think in all honesty our specific problem was most performantly tackled using a custom format. That has other downsides, but performance alone was not one of them.Receptacle
Out of curiosity, which lang/platform is that on? My experience is with Java. And I am not saying custom format can't be fast (certainly can), just that it's hard to be significantly faster.Escalade
Yep it's Java too. Custom text format - for our very specific range of deserializations, i.e. objects only contained a small number of child possibilities - was faster than XML, JSON, and even binary deserialization. Partially due to simple size: whereas binary and XML rows in the DB might be, say, 10000 x 1.5kb in size, with corresponding disk IO, our custom format might only be 10000 x 0.6Kb.Receptacle
Ok. Thanks for details -- I have just seen many devs use suboptimal tools to begin with, which can give sort of skewed baseline (like, use DOM for xml processing, or org.json package)Escalade

© 2022 - 2024 — McMap. All rights reserved.