Great question. You identify a few very important reasons that using a "normalized" data model (different document types with links) is an optimal model:
- You have a many-to-many relationship between users <==> lists <==> games.
- One-to-many relationships are easy to represent in a single document that uses a container for the 'many' portion, but they get large and you can have concurrency conflicts.
- Extending the single-doc model to store a many-to-many relationship is untenable.
- In general, document immutability is a great fit for concurrent systems. In CouchDB you do that exactly as you noted, by storing 'write-once' documents that represent an edge in your graph, then using secondary indexes to rebuild the portions of the links that you want and to get the information you want in a single API query call.
You're also right that the solution here is a 'map-side-join' (to borrow from the hadoop community). Basically you want to use different rows in the map output to represent different pieces of information. You can then use a range query (startkey/endkey) to query just the portion of the map result that you need, and, voila, your materialized view of the 'join' table. However, the one piece of the puzzle you didn't find in the documentation is this:
3.2.3.1. Linked Documents
If your map function emits an object value which has {'_id': XXX}
and you query view with include_docs=true
parameter, then CouchDB will fetch the document with id XXX
rather than the document which was processed to emit the key/value pair.
Says it all. That's how you dereference the pointer to the linked document that you have stored by a foreign key. You then combine that with the use of compound keys (keys that are JS arrays) and the view collation rules.
So that your view rows we be sorted like:
["list_1"], null
["list_1", "game"], {"_id":"game_1234"}
["list_1", "game"], {"_id":"game_5678"}
["list_2"], null
["list_2","game"], {"_id":"game1234"}
["list_3"], null
...
Putting this together with your existing data model, here is some (untested) pseudocode that should do the trick:
function(doc) {
if (doc.type=="list") {
//this is the one in the one-to-many
emit( [doc._id]),);
}
else if (doc.type=="relationship") {
//this is the many in the one-to-many
//doc.list_id is our foreign key to the list. We use that as the key
//doc.game_id is the foreign key to the game. We use that as the value
emit( [doc.list_id,'game'], {'_id': doc.game_id});
}
}
Finally, you would query that with a startkey/endkey so that you get all rows that begin with the list_id that you're interested in. It would look something like:
curl -g 'https://usr:[email protected]/db/_design/design_doc_name/_view/view_name?startkey=["123"]&endkey=["123",{}]&include_docs=true'
The -g
option tells curl not to glob, which means you don't have to dereference your square brackets, etc, and the include_docs=true
option will follow the pointer to the foreign key that you specified with game_id
in the relationship
document.
Analysis:
- You are using essentially immutable documents to store state changes and you let the database compute the aggregate state for you. That is a lovely model at scale, and one of our most successful patterns.
- Very efficient for additions or deletions to lists.
- Excellent scaling properties under high-concurrency
- At Cloudant (and CouchDB v2.0) we don't yet have 'read-your-writes consistency for secondary indexes. It is high in the priority list, but there are potential corner cases where, in failure scenarios or high-load, you may not see immediate consistency between the primary and secondary indexes. Long story short, quorum is used for primary indexes, but quorum isn't a viable model for secondary indexes, so another consistency strategy is being developed.
if (doc.type=="list") {
beemit(doc._id,doc)
? – Jab