How to use MongoDB aggregation for general purpose set operations (union, intersection, difference)
Asked Answered
L

2

11

I have come across some special purpose implementation of set operations, but nothing for the general case. What is the general case for performing set operations (specifically intersection, union, symmetric difference). This is easier to figure out using javascript in a $where or map reduce, but I want to know how to do this in aggregation in order to get native performance.

The better way to illustrate this question is with an example. Say I have a record with 2 arrays/sets:

db.colors.insert({
    _id: 1,
    left : ['red', 'green'],
    right : ['green', 'blue']
});

I want to find the union, intersection and difference of the 'left' and 'right' arrays. Even better, pictorially I want to find:

Union --> ['red', 'green', 'blue']

union

Intersection --> ['green']

enter image description here

Symmetric Difference --> ['red', 'blue']

enter image description here

Lh answered 24/6, 2013 at 5:52 Comment(7)
This question was inspired from this answer. I found several specific cases, but not the general set cases I was looking for.Lh
possible duplicate of Use MongoDB aggregation to find set intersection of two sets within the same documentStruve
@WiredPrairie, given that the one you linked to is mine, I can assure you they are not duplicate questions. The one you linked to is a very specific case where a subset is found. This is a broader more general case question around intersection, unions and differences.Lh
I'm still leaving it as a close as this is better as a wiki article or a blog post than a really long Stack Overflow question and answer, especially as there really isn't a good answer as was already mentioned (and this really is 3 questions).Struve
@WiredPrairie, that seems fair. What I'd ask is, what is the criteria for making a community wiki? I agree with you that it's better as a wiki, which luckily SO supports. I'd actually think that it would be better to close the other question as, personally, I think this could be of more use.Lh
Not sure why someone down voted. You can vote to close if you think it's a dup, I do not, but it's still a good question.Lh
Voting on StackOverflow is one of the many mysteries in life that may never be fully understood. :)Struve
L
4

Version 2.6+ Only:

As of version 2.6 of MongoDB, this has become much much easier. You can now do the following to solve this problem:

Union

db.colors.aggregate([
    {'$project': {  
                    union:{$setUnion:["$left","$right"]}
                 }
    }
]);

Intersection

db.colors.aggregate([
    {'$project': {  
                  int:{$setIntersection:["$left","$right"]}
                 }
    }
]);

Relative Complement

db.colors.aggregate([
    {'$project': {  
                    diff:{$setDifference:["$left","$right"]}
                 }
    }
]);

Symmetric Difference

db.colors.aggregate([
    {'$project': {  
                    diff:{$setUnion:[{$setDifference:["$left","$right"]}, {$setDifference:["$right","$left"]}]}
                 }
    }
]);

Note: There is a ticket requesting symmetric difference be added as a core feature rather than having to do the union of two set differences.

Lh answered 23/9, 2013 at 23:41 Comment(0)
L
3

The easiest of these three using aggregation is the intersection**. The general case for that can be done using aggregation like so:

Intersection:

db.colors.aggregate([
    {'$unwind' : "$left"},
    {'$unwind' : "$right"},
    {'$project': {  
                    value:"$left", 
                    same:{$cond:[{$eq:["$left","$right"]}, 1, 0]}
                 }
    },
    {'$group'  : { 
                    _id: {id:'$_id', val:'$value'}, 
                    doesMatch:{$max:"$same"}
                 }
    },
    {'$match'   :{doesMatch:1}},
]);

The other two become a bit more tricky. To my knowledge there isn't a way of combining two separate fields in the same document together. It would be nice to have an $add, $combine, or $addToSet in the $project pipeline phase, but this doesn't exist. So the best we can do is say if something has intersected or not. We can start both aggregations with the following:

db.colors.aggregate([
    {'$unwind' : "$left"},
    {'$unwind' : "$right"},
    {'$project': {  
                    left:"$left",
                    right:'$right',
                    same:{$cond:[{$eq:["$left","$right"]}, 1, 0]}
                 }
    },
    {'$group'  : {
                    _id:{id:'$_id', left:'$left'},
                    right:{'$addToSet':'$right'},
                    sum: {'$sum':'$same'},
                 }
    },
    {'$project': {  
                    left:{val:"$_id.left",inter:"$sum"},
                    right:'$right',
                 }
    },
    {'$unwind' : "$right"},
    {'$project': {  
                    left:"$left",
                    right:'$right',
                    same:{$cond:[{$eq:["$left.val","$right"]}, 1, 0]}
                 }
    },
    {'$group'  : {
                    _id:{id:'$_id.id', right:'$right'},
                    left:{'$addToSet':'$left'},
                    sum: {'$sum':'$same'},
                 }
    },
    {'$project': {  
                    right:{val:"$_id.right",inter:"$sum"},
                    left:'$left',
                 }
    },
    {'$unwind' : "$left"},
    {'$group'  : {
                    _id:'$_id.id',
                    left:{'$addToSet':'$left'},
                    right: {'$addToSet':'$right'},
                 }
    },
]);

This aggregation on the sample provided in the question will give a result like this:

{
        "_id" : 1,
        "left" : [
                {
                        "val" : "green",
                        "inter" : 1
                },
                {
                        "val" : "red",
                        "inter" : 0
                }
        ],
        "right" : [
                {
                        "val" : "blue",
                        "inter" : 0
                },
                {
                        "val" : "green",
                        "inter" : 1
                }
        ]
}

From here we can get the intersection by adding the following to the aggregation:

{'$project': {  
                    left:"$left"
                 }
    },
    {'$unwind' : "$left"},
    {'$match'  : {'left.inter': 1}},
    {'$group'  : {
                    _id:'$_id',
                    left:{'$addToSet':'$left'},
                 }
    },

We can find the difference as well as the relative complement by adding the following to the end of the base aggregation:

enter image description here

{'$unwind' : "$left"},
    {'$match'  : {'left.inter': 0}},
    {'$unwind' : "$right"},
    {'$match'  : {'right.inter': 0}},
    {'$group'  : {
                    _id:'$_id',
                    left:{'$addToSet':'$left'},
                    right:{'$addToSet':'$right'},
                 }
    },

Unfortunately there does not appear to be a good way to combine dissimilar items from different fields together. In order to get the union, it seems best to do that from the client. Or if you want filtering, do it on each set individually.

Lh answered 24/6, 2013 at 5:52 Comment(4)
seriously? I pointed you to an answer to "union", gave you the answer for relative complement and intersection. is there really a need for another question on this?Woodie
@Asya, this is no knock on you. I gave you credit in my comments above. The reason I added this was that those others were more specific cases. The first being where you want to parse out where there is an intersection. The second case you helped me find the left complement. I think there is a use for another question because the others were very specific to a use case. I'm trying to help the community, not take away any credit from the answers you posted. This question servers better as community wiki type. You should be happier, I'm helping your company and paying it.Lh
When possible, this is the type of thing I'd precompute and store. I wouldn't want to encounter an aggregation pipeline like this in a production system.Struve
I agree, the use case where I was looking to use it was in place of a $where. For my case at least, I had a schema that is tuned for day to day tasks. The schema isn't as good for answering some one off questions. I seemed reasonable to try this. I guess the counter could be if you're doing ad hoc anyway and real time performance doesn't matter, just use $where or map reduce and be done.Lh

© 2022 - 2024 — McMap. All rights reserved.