I'm struggling with the data modelling for the use cases I'm trying to satisfy. I looked at this and this and even this but they're not exactly what I needed.
I have this basic table:
CREATE TABLE documents (
itemid_version text,
xml_payload text,
insert_time timestamp,
PRIMARY KEY (itemid_version)
);
itemid
is actually a UUID (and unique for all documents), and version
is an int (version 0 is the "first" version). xml_payload
is the full XML doc, and can get quite big. Yes, I'm essentially creating a versioned document store.
As you can see, I concatenated the two to create a primary key and I'll get to why I did this later as I explain the requirements and/or use cases:
- Use-Case 1: user needs to get the single (1) doc, user knows the item id and version (not necessarily the latest)
- Use-Case 2: user needs to get the single (1) doc, user knows the item id but does not know the latest version
- Use-Case 3: user needs the version history of a single (1) doc
- Use-Case 4: user needs to get the list (1 or more) of docs, user knows the item id AND version (not necessarily the latest)
I will be writing the client code that will perform the use cases, please excuse the syntax as I'm trying to be language-agnostic.
First one's straightforward:
$itemid_version = concat($itemid, $version)
$doc = csql("select * from documents where itemid_version = {0};"
-f $itemid_version)
Now to satisfy the 2nd and 3rd use cases, I am adding the following table:
CREATE TABLE document_versions (
itemid uuid,
version int,
PRIMARY KEY (itemid, version)
) WITH clustering order by (version DESC);
The new records will be added as new docs and new versions of existing docs are created.
Now we have this (use case #2):
$latest_itemid, $latest_version = csql("select itemid,
version from document_versions where item_id = {0}
order by version DESC limit 1;" -f $itemid)
$itemid_version = concat($latest_itemid, $latest_version)
$doc = csql("select * from documents where itemid_version = {0};"
-f $itemid_version)
and this (use case #3):
$versions = csql("select version from document_versions where item_id = {0}"
-f $itemid)
For the 3rd requirement, I am adding yet another table:
CREATE TABLE latest_documents (
itemid uuid,
version int,
PRIMARY KEY (itemid, version)
)
The records are inserted for new docs, records are updated for existing docs.
And now we have this:
$latest_itemids, $latest_versions = csql("select itemid, version
from latest_documents where item_id in ({0})" -f $itemid_list.toCSV())
foreach ($one_itemid in $latest_itemids, $one_version in $latest_versions)
$itemid_version = concat($latest_itemid, $latest_version)
$latest_docs.append(
cql("select * from documents where itemid_version = {0};"
-f $itemid_version))
Now I hope it's clear why I concatenated itemid
and version
to create an index for documents
as opposed to creating a compound key: I cannot have OR
in the WHERE
clause in SELECT
.
You can assume that only one process will do the inserts/updates so you don't need to worry about consistency or isolation issues.
Am I on the right track here? There are quite a number of things that doesn't sit well with me...but mainly because I don't understand Cassandra yet:
- I feel that the primary key for
documents
should be a composite of (itemid, version) but I can't satisfy use case #4 (return a list from a query)...I can't possibly use a separate SELECT statement for each document due to the performance hit (network overhead)...or can (should) I? - It requires 2 trips to get a document if the version is not known beforehand. This is probably a compromise I have to live with, unless maybe there's a better way.