Disclaimer:
This is quite a long post. I first explain the data I am dealing with, and what I want to do with it.
Then I detail three possible solutions I have considered, because I've tried to do my homework (I swear :]). I end up with a "best guess" which is a variation of the first solution.
My ultimate question is: what's the most sensible way to solve my problem using Cassandra? Is it one of my attempts, or is it something else?
I am looking for advice/feedback from experienced Cassandra users...
My data:
I have many SuperDocuments that own Documents in a tree structure (headings, subheadings, sections, …).
Each SuperDocument structure can change (renaming of headings mostly) over time, thus giving me multiple versions of the structure as shown below.
What I'm looking for:
For each SuperDocument I need to timestamp those structures by date as above and I'd like, for a given date, to find the closest earlier version of the SuperDocument structure. (ie. the most recent version for which version_date < given_date
)
These considerations might help solving the problem more easily:
- Versions are immutable: changes are rare enough, I can create a new representation of the whole structure each time it changes.
- I do not need to access a subtree of the structure.
- I'd say it is OK to say that I do not need to find all the ancestors of a given leaf, nor do I need to access a specific node/leaf inside the tree. I can work all of this out in my client code once I have the whole tree.
OK let's do it
Please keep in mind I am really just starting using Cassandra. I've read/watched a lot of resources about data modeling, but haven't got much (any!) experience in the field!
Which also means everything will be written in CQL3... sorry Thrift lovers!
My first attempt at solving this was to create the following table:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY ((doc_id, version_date), pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (pre_pos ASC);
That would give me the following structure:
I'm using a Nested Sets model for my trees here; I figured it would work well to keep the structure ordered, but I am open to other suggestions.
I like this solution: each version has its own row, in which each column represents a level of the hierarchy.
The problem though is that I (candidly) intended to query my data as follows:
SELECT * FROM superdoc_structures
WHERE doc_id="3399c35...14e1" AND version_date < '2014-03-11' LIMIT 1
Cassandra quickly reminded me I was not allowed to do that! (because the partitioner does not preserve row order on the cluster nodes, so it is not possible to scan through partition keys)
What then...?
Well, because Cassandra won't let me use inequalities on partition keys, so be it!
I'll make version_date
a clustering key and all my problems will be gone. Yeah, not really...
First try:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY (doc_id, version_date, pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (version_date DESC, pre_pos ASC);
I find this one less elegant: all versions and structure levels are made into columns of a now very wide row (compared to my previous solution):
Problem: with the same request, using LIMIT 1
will only return the first heading. And using no LIMIT
would return all versions structure levels, which I would have to filter to only keep the most recent ones.
Second try:
there's no second try yet... I have an idea though, but I feel it's not using Cassandra wisely.
The idea would be to cluster by version_date
only, and somehow store whole hierarchies in each column values. Sounds bad doesn't it?
I would do something like this:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
nested_sets map<int, int>,
titles list<text>,
PRIMARY KEY (doc_id, version_date)
) WITH CLUSTERING ORDER BY (version_date DESC);
The resulting row structure would then be:
It looks kind of all right to me in fact, but I will probably have more data than the level title to de-normalize into my columns. If it's only two attributes, I could go with another map (associating titles with ids for instance), but more data would lead to more lists, and I have the feeling it would quickly become an anti-pattern.
Plus, I'd have to merge all lists together in my client app when the data comes in!
ALTERNATIVE & BEST GUESS
After giving it some more thought, there's an "hybrid" solution that might work and may be efficient and elegant:
I could use another table that would list only the version dates of a SuperDocument & cache these dates into a Memcache instance (or Redis or whatever) for real quick access.
That would allow me to quickly find the version I need to fetch, and then request it using the composite key of my first solution.
That's two queries, plus a memory cache store to manage. But I may end up with one anyway, so maybe that'd be the best compromise?
Maybe I don't even need a cache store?
All in all, I really feel the first solution is the most elegant one to model my data. What about you?!