What is the time complexity of .at and .loc in pandas?

About

Asked 15/11, 2019 at 12:3 Answered 15/5, 2022 at 3:26

Solved python pandas performance data-structures time-complexity

I'm looking for the time complexity of these methods as a function of the number of rows in a dataframe, n.

Another way of asking this question is: Are indexes for dataframes in pandas btrees (with log(n) time look ups) or hash tables (with constant time lookups)?

Asking this question because I'd like a way to do constant time look ups for rows in a dataframe based on a custom index.

Acorn answered 15/11, 2019 at 12:3 Comment(0)

From the Pandas Internals documentation, the default DataFrame index

Populates a dict of label to location in Cython to do O(1) lookups.

dict uses hash tables, supporting Peter Berg's answer to this question.

Dionysian answered 15/5, 2022 at 3:26 Comment(0)

Alright so it would appear that:

1) You can build your own index on a dataframe with .set_index in O(n) time where n is the number of rows in the dataframe

2) The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index. So accessing a row for the first time using that index takes O(n) time

3) All subsequent row access takes constant time.

So it looks like the indexes are hash tables and not btrees.

Acorn answered 15/11, 2019 at 18:14 Comment(8)

"The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index." Is it? Do you have a link to docs saying this? – Alatea 15/11, 2019 at 18:35

@Alatea Wes here makes a few references to khash from klib which I believe they use for the indices. – Beget 15/11, 2019 at 18:51

@Alatea No. Hence the preface "It would appear that". If I were able to find documentation on this I wouldn't have asked the question :) – Acorn 15/11, 2019 at 20:55

I would also be curious about the specific behavior you are seeing that makes it appear that way. There isn't a lot of laziness in pandas so I'm just curious. And granted, you are asking a question but you are also posting an answer :) – Alatea 15/11, 2019 at 20:58

Accessing the first row in a dataframe with ~10M rows using the custom index I built took ~3 seconds. Everything after that took milliseconds. Witnessed this behavior a few separate times. ¯\_(ツ)_/¯ – Acorn 15/11, 2019 at 21:1

Here's another related post. – Beget 15/11, 2019 at 21:2

Interesting, any details you can add to your answer I think are valuable – Alatea 15/11, 2019 at 21:10

@juanpa.arrivillaga, see my answer https://mcmap.net/q/1343032/-what-is-the-time-complexity-of-at-and-loc-in-pandas to this question. – Dionysian 15/5, 2022 at 3:28

From the Pandas Internals documentation, the default DataFrame index

Populates a dict of label to location in Cython to do O(1) lookups.

dict uses hash tables, supporting Peter Berg's answer to this question.

Dionysian answered 15/5, 2022 at 3:26 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags