Data Versioning in Cassandra with CQL3
Asked Answered
T

2

6

I'm struggling with the data modelling for the use cases I'm trying to satisfy. I looked at this and this and even this but they're not exactly what I needed.

I have this basic table:

CREATE TABLE documents (
    itemid_version text,       
    xml_payload text,
    insert_time timestamp,
    PRIMARY KEY (itemid_version)
); 

itemid is actually a UUID (and unique for all documents), and version is an int (version 0 is the "first" version). xml_payload is the full XML doc, and can get quite big. Yes, I'm essentially creating a versioned document store.

As you can see, I concatenated the two to create a primary key and I'll get to why I did this later as I explain the requirements and/or use cases:

  • Use-Case 1: user needs to get the single (1) doc, user knows the item id and version (not necessarily the latest)
  • Use-Case 2: user needs to get the single (1) doc, user knows the item id but does not know the latest version
  • Use-Case 3: user needs the version history of a single (1) doc
  • Use-Case 4: user needs to get the list (1 or more) of docs, user knows the item id AND version (not necessarily the latest)

I will be writing the client code that will perform the use cases, please excuse the syntax as I'm trying to be language-agnostic.

First one's straightforward:

$itemid_version = concat($itemid, $version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

Now to satisfy the 2nd and 3rd use cases, I am adding the following table:

CREATE TABLE document_versions (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
) WITH clustering order by (version DESC);

The new records will be added as new docs and new versions of existing docs are created.

Now we have this (use case #2):

$latest_itemid, $latest_version = csql("select itemid, 
    version from document_versions where item_id = {0} 
    order by version DESC limit 1;" -f $itemid)
$itemid_version = concat($latest_itemid, $latest_version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

and this (use case #3):

$versions = csql("select version from document_versions where item_id = {0}" 
    -f $itemid)

For the 3rd requirement, I am adding yet another table:

CREATE TABLE latest_documents (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
)

The records are inserted for new docs, records are updated for existing docs.

And now we have this:

$latest_itemids, $latest_versions = csql("select itemid, version 
    from latest_documents where item_id in ({0})" -f $itemid_list.toCSV())

foreach ($one_itemid in $latest_itemids, $one_version in $latest_versions)
    $itemid_version = concat($latest_itemid, $latest_version)
    $latest_docs.append(
        cql("select * from documents where itemid_version = {0};" 
        -f $itemid_version))        

Now I hope it's clear why I concatenated itemid and version to create an index for documents as opposed to creating a compound key: I cannot have OR in the WHERE clause in SELECT.

You can assume that only one process will do the inserts/updates so you don't need to worry about consistency or isolation issues.

Am I on the right track here? There are quite a number of things that doesn't sit well with me...but mainly because I don't understand Cassandra yet:

  • I feel that the primary key for documents should be a composite of (itemid, version) but I can't satisfy use case #4 (return a list from a query)...I can't possibly use a separate SELECT statement for each document due to the performance hit (network overhead)...or can (should) I?
  • It requires 2 trips to get a document if the version is not known beforehand. This is probably a compromise I have to live with, unless maybe there's a better way.
Trochilus answered 21/5, 2014 at 14:12 Comment(0)
L
4

This is actually very similar to your solution except that you can store all the versions and be able to fetch the 'latest' version just from one table (document_versions).

In most cases, I think you can get what you want in a single SELECT except use case #2 which requires fetching the most recent version of a document where a SELECT is first needed on document_versions.

CREATE TABLE documents (
        itemid_version text,
        xml_payload text,
        insert_time timestamp,
        PRIMARY KEY (itemid_version)
);

CREATE TABLE document_versions (
        itemid text,
        version int,
        PRIMARY KEY (itemid, version)
) WITH CLUSTERING ORDER BY (version DESC);


INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');

INSERT INTO document_versions (itemid, version) VALUES ('doc1', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc1', 2);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 2);
  • Use-Case 1: user needs to get the single (1) doc, user knows the item id and version (not necessarily the latest)

    SELECT * FROM documents WHERE itemid_version = 'doc1-2';
    
  • Use-Case 2: user needs to get the single (1) doc, user knows the item id but does not know the latest version (you need to feed the concatenated itemid + version from the result of the first query into the second query)

    SELECT * FROM document_versions WHERE itemid = 'doc2' LIMIT 1;
    
    SELECT * FROM documents WHERE itemid_version = 'doc2-2'; 
    
  • Use-Case 3: user needs the version history of a single (1) doc

    SELECT * FROM document_versions WHERE itemid = 'doc2';
    
  • Use-Case 3: user needs to get the list (1 or more) of docs, user knows the item id AND version (not necessarily the latest)

    SELECT * FROM documents WHERE itemid_version IN ('doc1-2', 'doc2-1');
    
Loading answered 21/5, 2014 at 16:4 Comment(5)
thanks @reggoodwin...as for use case #4, i know it's the use case that throws a monkey wrench here as your solution would have worked for me if i didn't have that requirement. basically, for use case #4 it boils down to "entitlements"--i.e., some users are only entitled up to a certain version of a doc (again, specific users may not be entitled to the latest version).Trochilus
actually, for this approach SELECT * FROM documents where itemid IN ('doc1', 'doc2'); returns ALL versions of the documents, not just the latest.Trochilus
Dexter, thanks for flagging up problem with my use case #4 example. I had not tested that last query before adding some final INSERTs. I've now re-submitted a different code example that I hope works better. It is quite similar to your solution actually.Loading
Actually this approach is a Cassandra antipattern since you are trying to minimize multiple reads. The first rule of thumb is that each table should be constructcted to serve a specific query.Genny
Any chance I can use the key-value system in tabular data to version it(data) using Cassandra? My aim is to provide user the insert, delete and update patch each time the data comes in and the version is to be formed. Note: these patches are the candidate version compared to current latest version.Andalusite
G
5

Let's see if we can come up with a model in a top-down fashion starting from your queries:

CREATE TABLE document_versions (
  itemid uuid,
  name text STATIC,
  version int,   
  xml_payload text,
  insert_time timestamp,
  PRIMARY KEY ((itemid), version)
) WITH CLUSTERING ORDER BY (version DESC);
  • Use-Case 1: user needs to get the single (1) doc, user knows the item id and version (not necessarily the latest)

    SELECT * FROM document_versions 
      WHERE itemid = ? and version = ?;
    
  • Use-Case 2: user needs to get the single (1) doc, user knows the item id but does not know the latest version

    SELECT * FROM document_versions
      WHERE itemid = ? limit 1;
    
  • Use-Case 3: user needs the version history of a single (1) doc

    SELECT * FROM document_versions 
      WHERE itemid = ?
    
  • Use-Case 4: user needs to get the list (1 or more) of docs, user knows the item id AND version (not necessarily the latest)

    SELECT * FROM documents 
      WHERE itemid = 'doc1' and version IN ('1', '2');
     ```
    
    

One table for all these queries is the correct approach.

I would suggest taking the Datastax free online course: DS220 Data Modeling

Genny answered 14/11, 2017 at 17:3 Comment(1)
indeed, this is probably the better answer for Cassandra v2.2 and above since it added support for IN clause for any partition field. note that i asked the question well over a year before v2.2 release.Trochilus
L
4

This is actually very similar to your solution except that you can store all the versions and be able to fetch the 'latest' version just from one table (document_versions).

In most cases, I think you can get what you want in a single SELECT except use case #2 which requires fetching the most recent version of a document where a SELECT is first needed on document_versions.

CREATE TABLE documents (
        itemid_version text,
        xml_payload text,
        insert_time timestamp,
        PRIMARY KEY (itemid_version)
);

CREATE TABLE document_versions (
        itemid text,
        version int,
        PRIMARY KEY (itemid, version)
) WITH CLUSTERING ORDER BY (version DESC);


INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');

INSERT INTO document_versions (itemid, version) VALUES ('doc1', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc1', 2);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 2);
  • Use-Case 1: user needs to get the single (1) doc, user knows the item id and version (not necessarily the latest)

    SELECT * FROM documents WHERE itemid_version = 'doc1-2';
    
  • Use-Case 2: user needs to get the single (1) doc, user knows the item id but does not know the latest version (you need to feed the concatenated itemid + version from the result of the first query into the second query)

    SELECT * FROM document_versions WHERE itemid = 'doc2' LIMIT 1;
    
    SELECT * FROM documents WHERE itemid_version = 'doc2-2'; 
    
  • Use-Case 3: user needs the version history of a single (1) doc

    SELECT * FROM document_versions WHERE itemid = 'doc2';
    
  • Use-Case 3: user needs to get the list (1 or more) of docs, user knows the item id AND version (not necessarily the latest)

    SELECT * FROM documents WHERE itemid_version IN ('doc1-2', 'doc2-1');
    
Loading answered 21/5, 2014 at 16:4 Comment(5)
thanks @reggoodwin...as for use case #4, i know it's the use case that throws a monkey wrench here as your solution would have worked for me if i didn't have that requirement. basically, for use case #4 it boils down to "entitlements"--i.e., some users are only entitled up to a certain version of a doc (again, specific users may not be entitled to the latest version).Trochilus
actually, for this approach SELECT * FROM documents where itemid IN ('doc1', 'doc2'); returns ALL versions of the documents, not just the latest.Trochilus
Dexter, thanks for flagging up problem with my use case #4 example. I had not tested that last query before adding some final INSERTs. I've now re-submitted a different code example that I hope works better. It is quite similar to your solution actually.Loading
Actually this approach is a Cassandra antipattern since you are trying to minimize multiple reads. The first rule of thumb is that each table should be constructcted to serve a specific query.Genny
Any chance I can use the key-value system in tabular data to version it(data) using Cassandra? My aim is to provide user the insert, delete and update patch each time the data comes in and the version is to be formed. Note: these patches are the candidate version compared to current latest version.Andalusite

© 2022 - 2024 — McMap. All rights reserved.