Python Sets vs Lists
Asked Answered
A

11

265

In Python, which data structure is more efficient/speedy? Assuming that order is not important to me and I would be checking for duplicates anyway, is a Python set slower than a Python list?

Astto answered 14/5, 2010 at 0:55 Comment(1)
I have an immediate follow-up that might be worth answering in this question: if sets are indeed optimized for search (x in s), what about construction + search ? It's relevant for instance in this case: which is faster: x in [a, b, c] or x in {a, b, c}? I.e. which should be our default go-to when constructing a set just to do a one-time search operation on it (as in an if clause)?Cyclopedia
A
308

It depends on what you are intending to do with it.

Sets are significantly faster when it comes to determining if an object is present in the set (as in x in s), but its elements are not ordered so you cannot access items by index as you would in a list. Sets are also somewhat slower to iterate over in practice.

You can use the timeit module to see which is faster for your situation.

Abutment answered 14/5, 2010 at 1:4 Comment(7)
For your point: "Sets are significantly faster ", what is the underlying implementation that makes it faster?Conventioner
Scripting languages like to hide the underlying implementations, but this apparent simplicity is not always a good thing, you do need some 'data structure' awareness when you design a piece of software.Gyrate
Set is not significantly slower than list while iterating.Particulate
Sets and lists both have linear time iteration. To say that one is "slower" than the other is misguided and has confused new programmers who read this answer.Sized
@Sized if you are saying that they both have linear time iteration. Does this mean they have the same iteration time? What is the difference then?Shower
They both have a running time complexity of O(n) when iterated, but the average-case complexity of iterating sets is ~28% greater (slower) than iterating listsCelt
To answer 'For your point: "Sets are significantly faster ", what is the underlying implementation that makes it faster?' Sets use hash functions to determine if an element is in it (if the hash function is good, i.e. collisions are not common, O(1) complexity), while to determine if an element is in a list, an iteration trough the list is necessary (O(n) complexity). Check this out for the time complexity for the methods on Python default data structures.Glossographer
D
233

Lists are slightly faster than sets when you just want to iterate over the values.

Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.

It turns out tuples perform in almost exactly the same way as lists, except for their immutability.

Iterating

>>> def iter_test(iterable):
...     for i in iterable:
...         pass
...
>>> from timeit import timeit
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = set(range(10000))",
...     number=100000)
12.666952133178711
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = list(range(10000))",
...     number=100000)
9.917098999023438
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = tuple(range(10000))",
...     number=100000)
9.865639209747314

Determine if an object is present

>>> def in_test(iterable):
...     for i in range(1000):
...         if i in iterable:
...             pass
...
>>> from timeit import timeit
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = set(range(1000))",
...     number=10000)
0.5591847896575928
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = list(range(1000))",
...     number=10000)
50.18339991569519
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = tuple(range(1000))",
...     number=10000)
51.597304821014404
Deceitful answered 30/7, 2013 at 10:51 Comment(2)
I have found that (Initializing set -> 5.5300979614257812) (Initializing list -> 1.8846848011016846) (Initializing tuple -> 1.8730108737945557) Items of size 10,000 on my intel core i5 quad core with 12GB RAM. This should be take into consideration also.Scleritis
I've updated the code to remove the object creation now. The setup phase of the timeit loops is only called once (docs.python.org/2/library/timeit.html#timeit.Timer.timeit).Deceitful
G
25

Set wins due to near instant 'contains' checks: https://en.wikipedia.org/wiki/Hash_table

List implementation: usually an array, low level close to the metal good for iteration and random access by element index.

Set implementation: https://en.wikipedia.org/wiki/Hash_table, it does not iterate on a list, but finds the element by computing a hash from the key, so it depends on the nature of the key elements and the hash function. Similar to what is used for dict. I suspect list could be faster if you have very few elements (< 5), the larger element count the better the set will perform for a contains check. It is also fast for element addition and removal. Also always keep in mind that building a set has a cost !

NOTE: If the list is already sorted, searching the list could be quite fast on small lists, but with more data a set is faster for contains checks.

Gyrate answered 2/8, 2016 at 14:35 Comment(6)
Close to the metal? What does that even mean in the context of Python? How is a list closer to the metal than a set?Porty
@roganjosh, python still runs on a machine and some implementations like list as 'array' are closer to what the hardware is good at: #176511, but it always depends on what you want to achieve, it is good to know a bit about the implementations, not just the abstractions.Gyrate
"If the list is already sorted, searching the list could be quite fast on small lists, but with more data a set is faster for contains checks." To avoid confusion, you should probably make it clear that sorting only helps if you take advantage of the sorted order with something like the bisect module; a plain in check on a list is O(n) regardless of whether or not it's sorted, while in checks on set are O(1). The bisect module can get the test down to O(log n) on a pre-sorted list, but it's more complicated to use than a simple in check.Kinney
@Porty A list maps more directly to instruction sets provided by modern CPUs. Ultimately all Python code executes some machine instructions . A hash algorithm is coded to be fast, but it is more instructions than a list traversal on a modern CPUs. When people consider big-O, they forget a constant factor. If 'n' is very small, then the constant can also dominate. However, I think it is wrong to say a list is an array, it is probably a linked list (in the 2nd sentence).Brnaby
Wouldn't make sense to have a special set which would only take integers, so there would be no need for hashes? Would that kind of set be faster?Factious
@Factious in Java the function computing the hash key depends on the types / objects, and for integer it just returns itself. Maybe Python does something similar.Gyrate
P
11

List performance:

>>> import timeit
>>> timeit.timeit(stmt='10**6 in a', setup='a = list(range(10**6))', number=1000)
15.08

Set performance:

>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=1000)
3.90e-05

You may want to consider Tuples as they're similar to lists but can’t be modified. They take up slightly less memory and are faster to access. They aren’t as flexible but are more efficient than lists. Their normal use is to serve as dictionary keys.

Sets are also sequence structures but with two differences from lists and tuples. Although sets do have an order, that order is arbitrary and not under the programmer’s control. The second difference is that the elements in a set must be unique.

set by definition. [python | wiki].

>>> x = set([1, 1, 2, 2, 3, 3])
>>> x
{1, 2, 3}
Pessimism answered 25/8, 2013 at 22:43 Comment(4)
First off, you should update to the set built-in type link (docs.python.org/2/library/stdtypes.html#set) not the deprecated sets library. Second, "Sets are also sequence structures", read the following from the built-in type link: "Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior."Megathere
range is not list. range is a special class with custom __contains__ magic method.Surfacetoair
@RyneWang this is true, but only for Python3. In Python2 range returns a normal list (that's why exists horrible things like xrange)Arrogate
This comparison is wrong. Just replicated and the first timeit is clearly done using python 3 with the range object __contains__ method as mentioned by @RyneWang. Updated with apt comparison.Acantho
P
9

tl;dr

Data structures (DS) are important because they are used to perform operations on data which basically implies: take some input, process it, and give back the output.

Some data structures are more useful than others in some particular cases. Therefore, it is quite unfair to ask which (DS) is more efficient/speedy. It is like asking which tool is more efficient between a knife and fork. I mean all depends on the situation.

Lists

A list is mutable sequence, typically used to store collections of homogeneous items.

Sets

A set object is an unordered collection of distinct hashable objects. It is commonly used to test membership, remove duplicates from a sequence, and compute mathematical operations such as intersection, union, difference, and symmetric difference.

Usage

From some of the answers, it is clear that a list is quite faster than a set when iterating over the values. On the other hand, a set is faster than a list when checking if an item is contained within it. Therefore, the only thing you can say is that a list is better than a set for some particular operations and vice-versa.

Peridium answered 11/7, 2019 at 6:18 Comment(0)
H
8

I was interested in the results when checking, with CPython, if a value is one of a small number of literals. set wins in Python 3 vs tuple, list and or:

from timeit import timeit

def in_test1():
  for i in range(1000):
    if i in (314, 628):
      pass

def in_test2():
  for i in range(1000):
    if i in [314, 628]:
      pass

def in_test3():
  for i in range(1000):
    if i in {314, 628}:
      pass

def in_test4():
  for i in range(1000):
    if i == 314 or i == 628:
      pass

print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))

Output:

tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469

For 3 to 5 literals, set still wins by a wide margin, and or becomes the slowest.

In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals. I couldn't distinguish the speed of tuple vs list.

When the values to test were cached in a global variable out of the function, rather than creating the literal within the loop, set won every time, even in Python 2.

These results apply to 64-bit CPython on a Core i7.

Haunting answered 25/10, 2019 at 20:20 Comment(2)
Your test is depending on implementation details here (and being messed with by them). By the natural rules of the language, the list and set cases would need to be rebuilt on every test (which would destroy their performance), and on older Python (definitely 2.x, not sure if older 3.x omitted the optimization) it does in fact rebuild the set literal on every pass, making it slower (Python 3 caches it as a constant frozenset to avoid the work). On both versions, your list test is actually being optimized to a tuple constant, so it's identical to the tuple case.Kinney
@Kinney Of course it depends on implementation details; that's the point of a benchmark, to check the performance of an implementation. This was a practical test to help deciding how to write these kinds of comparisons with CPython, which I've often run into.Haunting
U
3

Sets are faster, morover you get more functions with sets, such as lets say you have two sets :

set1 = {"Harry Potter", "James Bond", "Iron Man"}
set2 = {"Captain America", "Black Widow", "Hulk", "Harry Potter", "James Bond"}

We can easily join two sets:

set3 = set1.union(set2)

Find out what is common in both:

set3 = set1.intersection(set2)

Find out what is different in both:

set3 = set1.difference(set2)

And much more! Just try them out, they are fun! Moreover if you have to work on the different values within 2 list or common values within 2 lists, I prefer to convert your lists to sets, and many programmers do in that way. Hope it helps you :-)

Upswell answered 22/5, 2020 at 7:24 Comment(0)
A
1

I would recommend a Set implementation where the use case is limit to referencing or search for existence and Tuple implementation where the use case requires you to perform iteration. A list is a low-level implementation and requires significant memory overhead.

Antistrophe answered 7/5, 2018 at 8:35 Comment(1)
Indeed, the proper distinction between when to use Sets and when to use Tuple is indeed of utmost importance. I wouldn't be worried about the involved memory overheads, footprints unless I am scripting a lower-level API.Boyett
B
1

In the same vein as @Ellis Percival's tests, I'd like to add that lists perform in a similar way to sets when it comes to adding an element.

Adding an element

>>> def add_test_set(iterable):
...     for i in range(10000):
...         iterable.add(i)
...
>>> def add_test_list(iterable):
...     for i in range(10000):
...         iterable.append(i)
...
>>> timeit("add_test_set(iterable)",
...     setup="from __main__ import add_test_set; iterable = set()",
...     number=10000)
7.073143866999999
>>> timeit("add_test_list(iterable)",
...     setup="from __main__ import add_test_list; iterable = list()",
...     number=10000)
6.80650725000001

(I would have edited his post to include this but the edit queue was full)

Backhand answered 31/5, 2021 at 8:55 Comment(0)
T
0
from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
#Source Code

def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start

calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)

Output after comparing 10 iterations for all 3 : Comparison

Thetis answered 19/9, 2019 at 2:33 Comment(0)
A
0

Sets seem very fast to find if a value is present in it or not when compared to Lists.

A = list(range(1, 1000000))
A.pop(84559)

def find_in_set(A):
    maxA = max(A)
    for i in range(1,maxA+1):
         if i not in set(A):
            return i

This below find+in+list function takes almost ages to find missing 84559 from list 'A'. However the above function find_in_set takes just couple of seconds.

def find_in_list(A):
    maxA = max(A)
    for i in range(1,maxA+1):
         if i not in A:
            return i

Here you can find clear difference to the Sets vs. Lists

Aisne answered 13/6, 2023 at 15:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.