Looking for a fast, compact, streamable, multi-language, strongly typed serialization format
Asked Answered
G

3

5

I'm currently using JSON (compressed via gzip) in my Java project, in which I need to store a large number of objects (hundreds of millions) on disk. I have one JSON object per line, and disallow linebreaks within the JSON object. This way I can stream the data off disk line-by-line without having to read the entire file at once.

It turns out that parsing the JSON code (using http://www.json.org/java/) is a bigger overhead than either pulling the raw data off disk, or decompressing it (which I do on the fly).

Ideally what I'd like is a strongly-typed serialization format, where I can specify "this object field is a list of strings" (for example), and because the system knows what to expect, it can deserialize it quickly. I can also specify the format just by giving someone else its "type".

It would also need to be cross-platform. I use Java, but work with people using PHP, Python, and other languages.

So, to recap, it should be:

  • Strongly typed
  • Streamable (ie. read a file bit by bit without having to load it all into RAM at once)
  • Cross platform (including Java and PHP)
  • Fast
  • Free (as in speech)

Any pointers?

Gladis answered 28/7, 2009 at 2:17 Comment(3)
If pulling raw data off the disk is faster, why not do that? Why mess with JSON if it's slower?Ministerial
Okay, so parsing json is slower than decompressing, or reading the data off the disk. So what? Is it too slow for what you need to do? Or are you optimising just for the sake of it?Pulverable
Breton: it is too slow for what I need to do, its not a premature optimization.Gladis
M
8

Have you looked at Google Protocol buffers?:

http://code.google.com/apis/protocolbuffers/

They're cross platform (C++, Java, Python) with third party bindings for PHP also. It's fast, fairly compact and strongly typed.

There's also a useful comparison between various formats here:

http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

You might want to consider Thrift or one of the others mentioned here as well.

Mikvah answered 28/7, 2009 at 2:23 Comment(1)
...and, there's Google backing it.Alli
S
3

I've had very good results parsing JSON with Jackson

Jackson is a:

  • Streaming (reading, writing)
  • FAST (measured to be faster than any other Java json parser and data binder)
  • Powerful (full data binding for common JDK classes as well as any Java bean class, Collection, Map or Enum)
  • Zero-dependency (does not rely on other packages beyond JDK)
  • Open Source (LGPL or AL)
  • Fully conformant

JSON processor (JSON parser + JSON generator) written in Java. Beyond basic JSON reading/writing (parsing, generating), it also offers full node-based Tree Model, as well as full OJM (Object/Json Mapper) data binding functionality.

Its performance is very good when compared to many other serialisation options.

Samos answered 28/7, 2009 at 7:15 Comment(1)
Use Jackson before trying anything else. The code on json.org isn't suitable for production use.Krystenkrystin
A
2

You could take a look at YAML- http://www.yaml.org/

It's a superset of JSON so the data file structure will be familiar to you. It supports some additional data types as well as the ability to use references that include a part of one data structure into another.

I don't have any idea if it will be "fast enough"- but the libyaml parser (written in C) seems pretty snappy.

Appalachia answered 28/7, 2009 at 2:36 Comment(3)
While Yaml is in no way a superset of JSON, I agree that it is one of the most readable/compact/typed format I know.Bertha
yaml is way more complex than json. I think most implementations are slower.Demagoguery
AFAIK, yes, implementations are not very performant. YAML is geared towards somewhat different goals, maximum expressiveness and so on, not speed or simplicity.Chiropractic

© 2022 - 2024 — McMap. All rights reserved.