What is the time complexity of .at and .loc in pandas?
Asked Answered
A

2

8

I'm looking for the time complexity of these methods as a function of the number of rows in a dataframe, n.

Another way of asking this question is: Are indexes for dataframes in pandas btrees (with log(n) time look ups) or hash tables (with constant time lookups)?

Asking this question because I'd like a way to do constant time look ups for rows in a dataframe based on a custom index.

Acorn answered 15/11, 2019 at 12:3 Comment(0)
D
2

From the Pandas Internals documentation, the default DataFrame index

Populates a dict of label to location in Cython to do O(1) lookups.

dict uses hash tables, supporting Peter Berg's answer to this question.

Dionysian answered 15/5, 2022 at 3:26 Comment(0)
A
5

Alright so it would appear that:

1) You can build your own index on a dataframe with .set_index in O(n) time where n is the number of rows in the dataframe

2) The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index. So accessing a row for the first time using that index takes O(n) time

3) All subsequent row access takes constant time.

So it looks like the indexes are hash tables and not btrees.

Acorn answered 15/11, 2019 at 18:14 Comment(8)
"The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index." Is it? Do you have a link to docs saying this?Alatea
@Alatea Wes here makes a few references to khash from klib which I believe they use for the indices.Beget
@Alatea No. Hence the preface "It would appear that". If I were able to find documentation on this I wouldn't have asked the question :)Acorn
I would also be curious about the specific behavior you are seeing that makes it appear that way. There isn't a lot of laziness in pandas so I'm just curious. And granted, you are asking a question but you are also posting an answer :)Alatea
Accessing the first row in a dataframe with ~10M rows using the custom index I built took ~3 seconds. Everything after that took milliseconds. Witnessed this behavior a few separate times. ¯\_(ツ)_/¯Acorn
Here's another related post.Beget
Interesting, any details you can add to your answer I think are valuableAlatea
@juanpa.arrivillaga, see my answer https://mcmap.net/q/1343032/-what-is-the-time-complexity-of-at-and-loc-in-pandas to this question.Dionysian
D
2

From the Pandas Internals documentation, the default DataFrame index

Populates a dict of label to location in Cython to do O(1) lookups.

dict uses hash tables, supporting Peter Berg's answer to this question.

Dionysian answered 15/5, 2022 at 3:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.