data modeling in Cassandra with columns that can be text or numbers

Regarding answer that @BryceAtNetwork23 provided, it will be the same with Cassandra 2.1 or in Cassandra 2.2 (but Cassandra 3.0 will probably be a different story as the team is currently rewriting the storage engine, see CASSANDRA-8099). Data is stored is still stored in binary.

However there's more to say there. And you may want to consider the actual data being stored, and the performance your project need to achieve, query per seconds, etc.

Depending on these goals or constraints an interesting approach is to have a look at the size of the serialized data for a given type on cassandra.

If the data is a number, for example with a long in Java that has a size 8 bytes, there's a match the cassandra bigint type in size, that mean there's no cost associated when serializing, a plain copy will do. Also this has the benefit that the key is small enough so that it doesn't stress cassandra key cache.
If the data is a piece of text, for example a String in Java, which is encoded in UTF-16 in the runtime, but when serialized in Cassandra with text type then UTF-8 is used. UTF-16 always use 2 bytes per character and sometime 4 bytes, but UTF-8 is space efficient and depending on the character can be 1, 2, 3 or 4 bytes long.

That mean that there's CPU work to serialize such data for encoding/decoding purpose. Also depending on the text for example 158786464563, data will be stored with 12 bytes. That means more space is used and more IO as well.

Note cassandra offers the ascii type that follows the US-ASCII character set and is always using 1 byte per character.
If data is a UUID (a value of 128 bits), in Java the UUID type uses 2 longs so it is 16 bytes long, and Cassandra store them as 16 bytes as well (they use the Java UUID type).

Again that always depend on the mileage of your project, what are the goals, existing constraints. But here's my un-educated options :

If the data that has to be inserted is always a number that is inside the long range [−9,223,372,036,854,775,808 ; +9,223,372,036,854,775,807], I'll got for a bigint type
UUID is fine
If the cluster is not under heavy load (like 100k query per seconds) and space is not an issue then text is not an issue, but if it is or if usage may grow I'd avoid text for key if possible.

Another option is to use a blob type, i.e. a binary types, where it is possible to use any data the way you want according to the business of the software. This could allow space efficient, IO efficient storage, and to CPU efficient as well. But depending on the needs it may be necessary to manage a lot of things in the client code, like ordering, serialization, comparison, mapping, etc...

Recommended topics

Hot tags