Efficient way to store a JSON string in a Cassandra column?
Asked Answered
H

2

25

Cassandra newbie question. I'm collecting some data from a social networking site using REST calls. So I end up with the data coming back in JSON format.

The JSON is only one of the columns in my table. I'm trying to figure out what the "best practice" is for storing the JSON string.

First I thought of using the map type, but the JSON contains a mix of strings, numerical types, etc. It doesn't seem like I can declare wildcard types for the map key/value. The JSON string can be quite large, probably over 10KB in size. I could potentially store it as a string, but it seems like that would be inefficient. I would assume this is a common task, so I'm sure there are some general guidelines for how to do this.

I know Cassandra has native support for JSON, but from what I understand, that's mostly used when the entire JSON map matches 1-1 with the database schema. That's not the case for me. The schema has a bunch of columns and the JSON string is just a sort of "payload". Is it better to store the JSON string as a blob or as text? BTW, the Cassandra version is 2.1.5.

Any hints appreciated. Thanks in advance.

Haveman answered 31/3, 2016 at 19:54 Comment(0)
A
20

In the Cassandra Storage engine there's really not a big difference between a blob and a text, since Cassandra stores text as blobs essentially. And yes the "native" JSON support you speak of is only for when your data model matches your JSON model, and it's only in Cassandra 2.2+.

I would store it as a text type, and you shouldn't have to implement anything to compress your JSON data when sending the data (or handle uncompressing). Since Cassandra's Binary Protocol supports doing transport compression. Also make sure your table is storing the data compressed with the same compression algorithm (I suggest using LZ4 since it's the fastest algo implmeneted) to save on doing compression for each read request. Thus if you configure storing the data compressed and use transport compression, you don't even have to implement either yourself.

You didn't say which Client Driver you're using, but here's the documentation on how to setup Transport Compression for Datastax Java Client Driver.

Aneroidograph answered 31/3, 2016 at 22:13 Comment(4)
Thanks for the response. I'm using Spring Data Cassandra, 1.3.4.RELEASE, which forces me to stay with a Cassandra Driver version 2.X. I'm using 2.1.9 as the driver version. Spring provides a factory bean to create the Cluster instance, and it looks like they only support no compression, or Snappy. The method for specifying this takes an Enum as the only argument, and the Enum only has those two options. Not sure why. I guess I'll try with Snappy for now since it's supported. Or I may drop Spring Data Cassandra and just instantiate the Cluster manually.Haveman
I am not a fan of spring-data for Cassandra, since it's API was designed for relational databases that has drove some poor implementation decisions. Examples include: CassandraOperations.insert(list<object> objects) will do a BATCH statement for all of the inserts, which is an anti-pattern. If you implement Pagable data, it will do a count(*), and by default it doesn't use autopaging of the data (you have to opt into it via Pagable Slices). As such I highly recommend using the Datastax driver instead, you will get better control and features for developing against Cassandra.Aneroidograph
Good to know @fromanator. Another limitation I ran into recently is that Spring Data Cassandra does not support the 3.X drivers from DataStax, so for the time being, I'm stuck on 2.X.Haveman
You should be able to use both Spring-Data and the Datastax Driver concurrently without issues (behind the scenes each will have their own Cluster connection). That way you could start migrating off Spring-Data. I would also checkout the mapping-api in the Datastax driver if you choose Spring-Data for something ORM like.Aneroidograph
A
9

It depends on how to want to query your JSON. There are 3 possible strategies:

  1. Store as a string
  2. Store as a compressed blob
  3. Store as a blob

Option 1 has the advantage of being human readable when you query your data on command line with cqlsh or if you want to debug data directly live. The drawback is the size of this JSON column (10k)

Option 2 has the advantage to keep the JSON payload small because text elements have a pretty decent compression ration. Drawbacks are: a. you need to take care of compression/decompression client side and b. it's not human readable directly

Option 3 has drawbacks of option 1 (size) and 2 (not human readable)

Anemochore answered 31/3, 2016 at 21:48 Comment(2)
You should be able to utilize the Table level compression along with the Binary Transport Compression so you don't have to handle the compression yourself. This way you can store it as a text data type, have it be compressed when saved and when sent over the wire to your application, along with it being easily human readable (since your client driver or even cqlsh will present it to you in non-compressed form).Aneroidograph
Yes, there are table compression and transport compression options also, +1Anemochore

© 2022 - 2024 — McMap. All rights reserved.