Python list intersection efficiency: generator or filter()?
Asked Answered
M

4

16

I would like to intersect two lists in Python (2.7). I need the result to be iterable:

list1 = [1,2,3,4]
list2 = [3,4,5,6]
result = (3,4) # any kind of iterable

Providing a full iteration will be performed first thing after the intersection, which of the following is more efficient?

Using a generator:

result = (x for x in list1 if x in list2)

Using filter():

result = filter(lambda x: x in list2, list1)

Other suggestions?

Thanks in advance,
Amnon

Mailand answered 16/6, 2011 at 9:13 Comment(0)
A
32

Neither of these. The best way is to use sets.

list1 = [1,2,3,4]
list2 = [3,4,5,6]
result = set(list1).intersection(list2)

Sets are iterable, so no need to convert the result into anything.

Aylward answered 16/6, 2011 at 9:16 Comment(5)
interesting, set(list1).intersection(list2) is faster than set(list1) & set(list2), i guess it's because the creation of two sets is more expensive than loading and calling .intersection() hmm ..Involution
@Involution On my machine, set(list1) & set(list2) is faster than using .intersection(). But the difference is not very significant.Scriptural
does this require the lists to be sorted?Whichsoever
@Whichsoever The lists need not be sorted. Sets are implemented by hashing, so lookup occurs in amortized O(1) time regardless of location in the original list.Savannahsavant
Fast forward 11 years, in my case [x for x in a if x in b] seems for be faster when being executed in multiple loops. Would this be because a conversion from a set to a list would be consuming when being done in a loop?Pyrochemical
P
8

Your solution has a complexity of O(m*n), where m and n are the respective lengths of the two lists. You can improve the complexity to O(m+n) using a set for one of the lists:

s = set(list1)
result = [x for x in list2 if x in s]

In cases where speed matters more than readability (that is, almost never), you can also use

result = filter(set(a).__contains__, b)

which is about 20 percent faster than the other solutions on my machine.

Pibgorn answered 16/6, 2011 at 9:16 Comment(0)
A
7

I tried to compare the speed of 3 methods of list intersection:

import random

a = [random.randint(0, 1000) for _ in range(1000)]
b = [random.randint(0, 1000) for _ in range(1000)]

Solution 1: list comprehension

Time elapse: 8.95265507698059

import time
start = time.time()
for _ in range(1000):
    result = [x for x in a if x in b]
elapse = time.time() - start
print(elapse) 

Solution 2: set

Time elapse: 0.09089064598083496

start = time.time()
for _ in range(1000):
    result = set.intersection(set(a), set(b))
elapse = time.time() - start
print(elapse) 

Solution 3: numpy.intersect1d

Time elapse: 0.323300838470459

start = time.time()
for _ in range(1000):
    result = np.intersect1d(a, b)
elapse = time.time() - start
print(elapse) 

Conclusion

I think use set.intersection is the fastest way.

Anatol answered 17/11, 2019 at 4:22 Comment(0)
D
2

for the case of lists, the most efficient way is to use:

result = set(list1).intersection(list2)

as mentioned, but for numpy arrays, intersection1d function is more efficient:

import numpy as np
result = np.intersection1d(list1, list2)

Especially, when you know that the lists don't have duplicate values, you can use it as:

result = np.intersection1d(list1, list2, assume_unique=True)
Dugger answered 14/7, 2017 at 18:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.