Efficiently finding the intersection of a variable number of sets of strings
Asked Answered
O

8

47

I have a variable number of ArrayList's that I need to find the intersection of. A realistic cap on the number of sets of strings is probably around 35 but could be more. I don't want any code, just ideas on what could be efficient. I have an implementation that I'm about to start coding but want to hear some other ideas.

Currently, just thinking about my solution, it looks like I should have an asymptotic run-time of Θ(n2).

Thanks for any help!

tshred

Edit: To clarify, I really just want to know is there a faster way to do it. Faster than Θ(n2).

Ovine answered 17/5, 2010 at 19:3 Comment(2)
Thanks for the help everyone! The strings are actually inside objects in an already existing array list, this is why I was leaving them in the arrays. I've never had to use the Java collections classes being mentioned but will definitely use them. I appreciate the recommendations. Problem solved.Ovine
See also: How to calculate the intersection of two sets?Tripper
D
64

Set.retainAll() is how you find the intersection of two sets. If you use HashSet, then converting your ArrayLists to Sets and using retainAll() in a loop over all of them is actually O(n).

Dunstan answered 17/5, 2010 at 19:11 Comment(2)
You only have to wrap one of the lists in a set.Trevethick
It is only expected to be in O(n). It is not the worst case!Roup
D
39

The accepted answer is just fine; as an update : since Java 8 there is a slightly more efficient way to find the intersection of two Sets.

Set<String> intersection = set1.stream()
    .filter(set2::contains)
    .collect(Collectors.toSet());

The reason it is slightly more efficient is because the original approach had to add elements of set1 it then had to remove again if they weren't in set2. This approach only adds to the result set what needs to be in there.

Strictly speaking you could do this pre Java 8 as well, but without Streams the code would have been quite a bit more laborious.

If both sets differ considerably in size, you would prefer streaming over the smaller one.

Dalrymple answered 6/10, 2016 at 17:56 Comment(7)
Good note no streaming over the smaller one. It is because the streamed one is iterated, while the other (larger) set is searched (by hash for a HashSet, which is O(1)).Harriott
@Dalrymple Correct me if I'm wrong but I believe changing the .collect(Collectors.toSet()) to .forEach(e -> ...) would be the ideal way to find the intersection of the two sets WITHOUT creating a copy of either set, or a reference to the intersection. In other words if a = { 1, 2, 3 } and b = { 4, 5, 6 } and forEach(e -> ...) was called, there would only ever be seven element references existing at one time. Three from set a, three from set b, and one from the callback variable e.Headwind
@Dalrymple Also to add onto what @Ondra Žižka mentioned, surely you would first check which of set1 or set2 has the smallest number of elements, and that set would be the one which is streamed. The remaining set would be the one referenced with contains.Headwind
@Headwind if contextually you know the sets will be relatively small, or always comparable in size, such a check would not be needed. And I say as much in my answer.Dalrymple
@Dalrymple Despite something like this rarely being the cause of a bottleneck, mistakes like this all over a codebase absolutely will be. It's unfair to dismiss the criticism, especially to those who may be viewing this who are new to programming and are looking to build good coding habits. This takes a single if statement to fix if (set1.size() > set2.size()) ... else ....Headwind
@Headwind well, I think increasing cyclomatic complexity for something that won't gain you any performance 99% of the time, isn't a good coding habit.Dalrymple
@Headwind a small example : we have a set with all items on sale, and a set with the items in the customers basket. I'd not bother with an if. Contextually I know that most of the time the basket set will be the smaller. I will always stream over the basket, even though sometimes it may be larger than the on-sale set. If I have a set of tweets mentioning one subject, and another set mentioning another subject both possibly counting in the millions, but both possibly being very small, yes, then I'd check.Dalrymple
F
19

There is also the static method Sets.intersection(set1, set2) in Google Guava that returns an unmodifiable view of the intersection of two sets.

Fabriane answered 21/4, 2015 at 11:30 Comment(1)
Because it's a view I remember this having the unfortunate side effect of repeating intersections, so need to be careful about use case. if (intersetionSet.size() < 1000) { doSomethingWith(intersetionSet); }Mort
L
7

One more idea - if your arrays/sets are different sizes, it makes sense to begin with the smallest.

Legged answered 17/5, 2010 at 19:20 Comment(0)
R
3

The best option would be to use HashSet to store the contents of these lists instead of ArrayList. If you can do that, you can create a temporary HashSet to which you add the elements to be intersected (use the putAll(..) method). Do tempSet.retainAll(storedSet) and tempSet will contain the intersection.

Rattling answered 17/5, 2010 at 19:12 Comment(0)
G
0

Sort them (n lg n) and then do binary searches (lg n).

Get answered 17/5, 2010 at 19:10 Comment(0)
T
0

You can use single HashSet. It's add() method returns false when the object is alredy in set. adding objects from the lists and marking counts of false return values will give you union in the set + data for histogram (and the objects that have count+1 equal to list count are your intersection). If you throw the counts to TreeSet, you can detect empty intersection early.

Threecornered answered 17/5, 2010 at 20:4 Comment(0)
P
0

In case that is required the state if 2 set has intersection, I use the next snippet on Java 8+ versions code:

set1.stream().anyMatch(set2::contains)
Pyridoxine answered 23/4, 2021 at 12:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.