Use MongoDB aggregation to find set intersection of two sets within the same document

Asked 23/6, 2013 at 18:38 Answered 24/9, 2013 at 0:24

Solved mongodb aggregation-framework set-intersection

I'm trying to use the Mongo aggregation framework to find where there are records that have different unique sets within the same document. An example will best explain this:

Here is a document that is not my real data, but conceptually the same:

db.house.insert(
{
    houseId : 123,
    rooms: [{ name : 'bedroom',
          owns : [
            {name : 'bed'},
            {name : 'cabinet'}
        ]},
        { name : 'kitchen',
          owns : [
            {name : 'sink'},
            {name : 'cabinet'}
        ]}],
       uses : [{name : 'sink'},
               {name : 'cabinet'},
               {name : 'bed'},
               {name : 'sofa'}]
}
)

Notice that there are two hierarchies with similar items. It is also possible to use items that are not owned. I want to find documents like this one: where there is a house that uses something that it doesn't own.

So far I've built up the structure using the aggregate framework like below. This gets me to 2 sets of distinct items. However I haven't been able to find anything that could give me the result of a set intersection. Note that a simple count of set size will not work due to something like this: ['couch', 'cabinet'] compare to ['sofa', 'cabinet'].

{'$unwind':'$uses'}
{'$unwind':'$rooms'}
{'$unwind':'$rooms.owns'}
{'$group' : {_id:'$houseId', 
             use:{'$addToSet':'$uses.name'}, 
             own:{'$addToSet':'$rooms.owns.name'}}}

produces:

{ _id : 123,
  use : ['sink', 'cabinet', 'bed', 'sofa'],
  own : ['bed', 'cabinet', 'sink']
}

How do I then find the set intersection of use and own in the next stage of the pipeline?

Amphi answered 23/6, 2013 at 18:38 Comment(7)

I tried using a $match with a $where, but it appears that this is not allowed. That means I can't even compare the size of the two. – Amphi 23/6, 2013 at 19:19

Have you considered using a MapReduce instead of aggregation (I ask because I can't see an effective way of doing an intersection between the two arrays). – Corvine 23/6, 2013 at 19:35

Ya, I did consider it (and it would certainly work) but wanted to use aggregation if possible. Map reduce doesn't provide a result back, but only dumps data to another collection. This doesn't work so well for more on-demand type needs. – Amphi 23/6, 2013 at 19:48

you can do this with aggregation framework but it's a bit ugly. I describe a possible solution here: #13521759 – Yoshikoyoshio 23/6, 2013 at 21:40

btw mapreduce does provide result back if you specify {inline:1} as your "out" parameter - but aggregation framework will be faster (and by a lot). – Yoshikoyoshio 23/6, 2013 at 21:40

you're not kidding @AsyaKamsky. I think for this use case, it may actually be more apparent to someone reading the code to use the $where solution. The difference between this one and the one you linked is that this case isn't really an aggregate, just an elaborate filter. I'm not really combining any records in the end, just filtering out. Given that, it may not make as much sense to use aggregation or map reduce. – Amphi 23/6, 2013 at 22:15

how about this - I will post the aggregation framework solution as an answer and you can compare it to using $where both for readability and for performance ($where is not performant due to the fact that it needs to run Javascript, same as MapReduce). – Yoshikoyoshio 23/6, 2013 at 23:15

For MongoDB 2.6+ Only

As of MongoDB 2.6, there are set operations available in the project pipeline stage. The way to answer this problem with the new operations is:

db.house.aggregate([
    {'$unwind':'$uses'},
    {'$unwind':'$rooms'},
    {'$unwind':'$rooms.owns'},
    {'$group' : {_id:'$houseId', 
             use:{'$addToSet':'$uses.name'}, 
             own:{'$addToSet':'$rooms.owns.name'}}},
    {'$project': {int:{$setIntersection:["$use","$own"]}}}
]);

Amphi answered 24/9, 2013 at 0:24 Comment(0)

You were not very far from the full solution with aggregation framework - you needed one more thing before the $group step and that is something that would allow you to see if all the things that are being used match up with something that is owned.

Here is the full pipeline

> db.house.aggregate(
       {'$unwind':'$uses'}, 
       {'$unwind':'$rooms'}, 
       {'$unwind':'$rooms.owns'}, 
       {$project:  { _id:0, 
                     houseId:1, 
                     uses:"$uses.name", 
                     isOkay:{$cond:[{$eq:["$uses.name","$rooms.owns.name"]}, 1, 0]}
                   }
       }, 
       {$group: { _id:{house:"$houseId",item:"$uses"}, 
                  hasWhatHeUses:{$sum:"$isOkay"}
                }
       },
       {$match:{hasWhatHeUses:0}})

and its output on your document

{
    "result" : [
        {
            "_id" : {
                "house" : 123,
                "item" : "sofa"
            },
            "hasWhatHeUses" : 0
        }
    ],
    "ok" : 1
}

Explanation - once you unwrap both arrays you now want to flag the elements where used item is equal to owned item and give them a non-0 "score". Now when you regroup things back by houseId you can check if any used items didn't get a match. Using 1 and 0 for score allows you to do a sum and now a match for item which has sum 0 means it was used but didn't match anything in "owned". Hope you enjoyed this!

Yoshikoyoshio answered 23/6, 2013 at 23:41 Comment(2)

btw, I could have used $max just as easily as $sum, the key is that for unowned item it will only have zeros as "isOkay". – Yoshikoyoshio 23/6, 2013 at 23:42

very clever, I hadn't noticed $cond in the documentation. I'll adapt to my data structure and post the difference in timing between this and the $where solution. – Amphi 24/6, 2013 at 0:43

So here is a solution not using the aggregation framework. This uses the $where operator and javascript. This feels much more clunky to me, but it seems to work so I wanted to put it out there if anyone else comes across this question.

db.houses.find({'$where': 
    function() { 
        var ownSet = {};
        var useSet = {};
        for (var i=0;i<obj.uses.length;i++){ 
            useSet[obj.uses[i].name] = true;
        }
        for (var i=0;i<obj.rooms.length;i++){ 
            var room = obj.rooms[i];
            for (var j=0;j<room.owns.length;j++){
                ownSet[room.owns[j].name] = true;
            }
        }
        for (var prop in ownSet) {
            if (ownSet.hasOwnProperty(prop)) {
                if (!useSet[prop]){
                    return true;
                }
            }
        }
        for (var prop in useSet) {
            if (useSet.hasOwnProperty(prop)) {
                if (!ownSet[prop]){
                    return true;
                }
            }
        }
        return false
    }
})

Amphi answered 23/6, 2013 at 19:59 Comment(0)

For MongoDB 2.6+ Only

As of MongoDB 2.6, there are set operations available in the project pipeline stage. The way to answer this problem with the new operations is:

db.house.aggregate([
    {'$unwind':'$uses'},
    {'$unwind':'$rooms'},
    {'$unwind':'$rooms.owns'},
    {'$group' : {_id:'$houseId', 
             use:{'$addToSet':'$uses.name'}, 
             own:{'$addToSet':'$rooms.owns.name'}}},
    {'$project': {int:{$setIntersection:["$use","$own"]}}}
]);

Amphi answered 24/9, 2013 at 0:24 Comment(0)

Recommended topics

Hot tags