What is the performance impact of non-unique indexes in pandas?
Asked Answered
S

2

62

From the pandas documentation, I've gathered that unique-valued indices make certain operations efficient, and that non-unique indices are occasionally tolerated.

From the outside, it doesn't look like non-unique indices are taken advantage of in any way. For example, the following ix query is slow enough that it seems to be scanning the entire dataframe

In [23]: import numpy as np
In [24]: import pandas as pd
In [25]: x = np.random.randint(0, 10**7, 10**7)
In [26]: df1 = pd.DataFrame({'x':x})
In [27]: df2 = df1.set_index('x', drop=False)
In [28]: %timeit df2.ix[0]
1 loops, best of 3: 402 ms per loop
In [29]: %timeit df1.ix[0]
10000 loops, best of 3: 123 us per loop

(I realize the two ix queries don't return the same thing -- it's just an example that calls to ix on a non-unique index appear much slower)

Is there any way to coax pandas into using faster lookup methods like binary search on non-unique and/or sorted indices?

Staceystaci answered 18/5, 2013 at 15:44 Comment(0)
F
115

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

You can call sort_index method:

import numpy as np
import pandas as pd
x = np.random.randint(0, 200, 10**6)
df1 = pd.DataFrame({'x':x})
df2 = df1.set_index('x', drop=False)
df3 = df2.sort_index()
%timeit df1.loc[100]
%timeit df2.loc[100]
%timeit df3.loc[100]

result:

10000 loops, best of 3: 71.2 µs per loop
10 loops, best of 3: 38.9 ms per loop
10000 loops, best of 3: 134 µs per loop
Foucquet answered 18/5, 2013 at 21:26 Comment(9)
I don't understand the timings at the end. df3 should be faster?Secularity
@Secularity I was confused too, but df1 uses the default index which goes from 0 to len(df1) - 1 and is unique, so df1.loc[] uses a hashtable. df2 sets the index to 'x' which is not unique and not sorted, so it does a linear scan, O(N). df3 is the same as df2 but sorted and still non-unique, so it does a binary search.Preussen
So why is the linear scan of df2 faster?Secularity
I don't get why pandas switches to binary search here. For multimaps, indexing can still be done in O(1+R), instead of O(logN + R) (where R is the number of results returned.Adda
When the index is unique, does it matter whether it's also sorted? And does the answer to this depend on whether the index is a MultiIndex?Pagel
Is this documented somewhere?Whitish
This timing comparison is actually very misleading, as the first statement df1.loc[100] does something quite different than the other two, namely retrieve the 100th row using the implicitly created RangeIndex, while the other two retrieve all rows with x == 100.Johnny
@Secularity 4 years late but df2 is not faster, speed is measured in milliseconds. Whereas for df1 & df3 the speed is measured in microseconds, easy to miss.Thule
Gosh, the world spins right again. Thank you @s_wheels.Secularity
B
45

@HYRY said it well, but nothing says it quite like a colourful graph with timings.

enter image description here

Plots were generated using perfplot. Code, for your reference:

import pandas as pd
import perfplot

_rnd = np.random.RandomState(42)

def make_data(n):    
    x = _rnd.randint(0, 200, n)
    df1 = pd.DataFrame({'x':x})
    df2 = df1.set_index('x', drop=False)
    df3 = df2.sort_index()

    return df1, df2, df3

perfplot.show(
    setup=lambda n: make_data(n),
    kernels=[
        lambda dfs: dfs[0].loc[100],
        lambda dfs: dfs[1].loc[100],        
        lambda dfs: dfs[2].loc[100],
    ],
    labels=['Unique index', 'Non-unique, unsorted index', 'Non-unique, sorted index'],
    n_range=[2 ** k for k in range(8, 23)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=False)
Bridgman answered 22/1, 2019 at 23:31 Comment(3)
I'm not seeing where you actually time the operations and am having trouble with timing pandas operations in general.Cosmonaut
@young_souvlaki I don't understand, the code is inlined in the answer under the graph, and you will need to install the perfplot library. For the actual methods being tested, check the make_data functions, then check the kernels arg to perfplot.showBridgman
Ah, perfplot is doing the timing.Cosmonaut

© 2022 - 2024 — McMap. All rights reserved.