Conceptual difference concerning column families in Cassandras data model compared to Bigtable?
Asked Answered
L

1

8

I am currently trying to dig into Cassandra's data model and its relation to Bigtable, but ended up with a strong headache concerning the Column Family concept.

Mainly my question was asked and already answered. However, I'm not satisfied with the answers :)

Firstly I've read the Bigtable paper especially concerning its data model, i.e. how data is stored. As far as I understood each table in Bigtable basically relies on a multi-dimensional sparse map with the dimensions row, column and time. The map is sorted by rows. Columns can be grouped with the name convention family:qualifier to a column family. Therefore, a single row can contain multiple column families (see the example figure in the paper).

Although it is stated that Cassandra relies on Bigtable data model, I read multiple times that in Cassandra a column family contains multiple rows and is to some extent comparable to a table in relational data stores. Isn't this contrary to Bigtable's approach, where a row could contain multiple column families? What comes first, the column family or row :)? Are these concepts even comparable?

Loyola answered 5/11, 2017 at 11:15 Comment(0)
T
28

The answer you linked to was from 6 years ago, and a lot has changed in Cassandra since. When Cassandra started out, its data model was indeed based on BigTable's. A row of data could include any number of columns, each of these columns has a name and a value. A row could have a thousand different columns, and a different row could have a thousand other columns - rows do not have to have the same columns. Such a database is called "schema-less", because there is no schema that each row needs to adhere to.

But Toto, we're not in Kansas any more - and Cassandra's model changed in focus (though not in essense) since, and I'll try to explain how and why:

As Cassandra matured, its developers started to realize that schema-less isn't as great as they once thought it was. Schemas are valuable in ensuring application correctness. Moreover, one doesn't normally get to 1000 columns in a single row just because there are 1000 individually-named fields in one record. Rather, the more common case is that the record actually contains 200 entries, each with 5 fields. The schema should fix these 5 fields that every one of these entries should have, and what defines each of these separate entries is called a "clustering key". So around the time of Cassandra 0.8, six years ago, these ideas where introduced to Cassandra as the "CQL" (Cassandra Query Language).

For example, in CQL one declares that a column-family (which was dutifully renamed "table") has a schema, with a known list of fields:

CREATE TABLE groups (
    groupname text,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, username)
)

This schema says that each wide row in the table (now, in modern Cassandra, this was renamed a "partition") with the key "groupname" is a a possibly long list of users, each with username, email and age fields. The first name in the "PRIMARY KEY" specifier is the partition key (it determines the key of the wide rows), and the second is called the clustering key (it determines the key of the small rows that together make up the wide rows).

Despite the new CQL dressup, Cassandra continued to implement these new concepts using the good-old-BigTable-wide-row-without-schema implementation. For example, consider that our data has a group "mygroup" with two people, (john, [email protected], 27) and (joe, [email protected], 38). Cassandra adds the following four column names->values to the wide row:

john:email -> [email protected]
john:age -> 27
joe:email -> [email protected]
joe:age -> 27

Note how we ended up with a wide row with 4 columns - 2 non-key fields per row (email and age), multiplied by the number of rows in the partition (2). The clustering key field "username" no longer appears anywhere as the value, but rather as part of the column's name! So If we have two username values "john" and "joe", We have some columns prefixed "john" and some columns prefixed "joe", and when we read the column "joe:email" we know this is the value of the email field of the row which has username=joe.

Cassandra still has this internal duality - converting the user-facing CQL rows and clustering keys into old-style wide rows. Until recently, Cassandra's on-disk format known as "SSTables" was still schema-less and used composite names as shown above for column names. I wrote a detailed description of the SSTable format on Scylla's site https://github.com/scylladb/scylla/wiki/SSTables-Data-File (Scylla is a more efficient C++ re-implementation of Cassandra to which I contribute). However, column names are very inefficient in this format so Cassandra recently (in version 3.0) switched to a different file format, which for the first time, accepts clustering keys and schema-full rows as first class citizens. This was the last nail in the coffin of the schema-less Cassandra from 7 years ago. Cassandra is now schema-full, all the way.

Theurich answered 5/11, 2017 at 23:25 Comment(2)
Wow, thanks for your time and the really helpful explanation :)!Loyola
This is a fantastic answer. I've been confused for hours by the Cassandra docs trying to figure out how to use it Bigtable-style, knowing after reading the paper that it was apparently inspired by Bigtable. Now, I see that they want devs to think in SQL rows instead of the rows as they're stored. I'm already familiar with Bigtable, so this abstraction is annoying to me. But, I can see how devs unfamiliar with Bigtable might appreciate it. It would also point them towards a stronger schema. They'd need a column named "property_name" etc to get a truly schema-less table.Precocity

© 2022 - 2024 — McMap. All rights reserved.