Bigtable / HBase: Rich column family vs a single JSON Object

Disclosure: I lead product management for Cloud Bigtable.

If you don't plan to retrieve or update data on a per-column granularity, your plan of storing JSON document as a single value is fine, particularly because if you store per-column data, the column family name itself (and the qualifier) need to also be stored within each row, thus adding storage overhead, which is proportional to the number of values and thus may be meaningful at your scale. In your model, you'll be using Bigtable as simply a key-value store.

If you do decide to break your JSON apart into many columns in the future, you can add additional column families to an existing Bigtable table (or just use additional column qualifiers within your existing column family) and rewrite your data via a parallel process such as Hadoop MapReduce or Google Cloud Dataflow.

Side note: JSON is very verbose and takes up a bit of space; while you can precompress it yourself, Cloud Bigtable natively compresses data (transparently) to help mitigate this. That said, one alternative to consider is protocol buffers or another binary encoding to be more efficient with space.

Given that you plan to store multiple petabytes of data, you will likely need more than the default quota of 30 Bigtable nodes—if so, please request additional quota for your use case.

Please see the Bigtable performance page for a ballpark measure of performance you should expect per Bigtable server node, but you should benchmark your specific read/write patterns to establish the baseline norms, and scale accordingly.

Best of luck with your project!

Recommended topics

Hot tags