bisect.insort complexity not as expected
Asked Answered
U

2

11

trying to find the most optimal data structure in python3 for a frotier problem i have to develop have just realised that the complexity of using the module bisect to make a real time ordered insert is not O(nlog n) as it should be and grows exponentially instead. do not know the reasoning of it so felt like asking u guys just in case know something about it since i find it really interesting.

think im using the module right so it shouldn't be a problem on my end, anyways here is the code used to insert node objects determining that insertion by a random f value nodes have.

bisect.insort(self._frontier, (node._f, node))

getting lots of objects in a few seconds, but then not that many over time. Bakuriu suggested me asking this question since he also found it interesting after doing some tests and ending up with same results as me. the code he used to test that out was the following:

python3 -m timeit -s 'import bisect as B; import random as R;seq=[]' 'for _ in range(100000):B.insort(seq, R.randint(0, 1000000))'

An these were his conclusions:

10k insertions is all fine (80ms and up to that point it basically scales linearly [keep in mind that it is O(nlog n) so it's a little bit worse than linear]) but with 100k it takes forever instead of 10 times more. A list of 100k elements isn't really that big and log(100k) is 16 so it's not that big.

any help will be much appreciated!

Upchurch answered 27/10, 2018 at 15:20 Comment(2)
you should not time the initialization, and all. Use an init section and just time your insertions. BTW python 2 or python 3?Fadein
that was not the code i used but the one from another user, i just made an stress test and realised that in few seconds it could insert almost 500.000 objects but in 3h just 5.000.000 which is kinda exponential. python3, updated the postUpchurch
C
20

You probably missed that the time complexity for insort is O(n) and this is documented clearly, for bisect.insort_left():

Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.

Finding the insertion point is cheap, but inserting into a Python list is not, as the elements past the insertion point have to be moved up a step.

Also see the TimeComplexity page on the Python Wiki, where list insertion is documented:

Insert O(n)

You can find the insertion point in O(log n) time, but the insertion step that follows is O(n), making this a rather expensive way to sort.

If you are using this to sort m elements, you have a O(m^2) (quadratic) solution for what should only take O(m log m) time with TimSort (the sorting algorithm used by the sorted() function).

Cassirer answered 27/10, 2018 at 15:26 Comment(7)
i get it, thanks for explaining! so u suggest me to insert all objects into the data strcuture and then call sorted() function for that list? i thought that would event get worse resultsUpchurch
@jupcan: no, because appending to a list is O(1) per append, for a total of O(n), then sorting is O(n log n) (lower if your data is already close to sorted). That's a O(n log n) algorithm.Cassirer
Terminological note: O(m^2) is considered polynomial, not exponential.Sulphur
indeed, i guess im just not used to python way to go since for example in java calling a bubble sort or some other way would get worse results for a large amount of data. have to make sure i dont call the sort function each time an object is inserted right? just at the endUpchurch
@jupcan: exactly. Mind you: if you are not going to use the full sorted list, and only a top K of them, consider using heapqCassirer
@user2357112: too many distractions, too many mistakes. Swapped out for quadratic.Cassirer
yes, exactly what i was talking with @user2357112, have tried them and ware way better but still don't know if im just gonna need a top element or all of them sorted. thank u very much for ur help guys!Upchurch
S
4

Binary search takes O(log n) comparisons, but insort isn't just a binary search. It also inserts the element, and inserting an element into a length-n list takes O(n) time.

The _frontier naming in your original code snippet suggests some sort of prioritized search algorithm. A heap would probably make more sense for that, or a SortedList from sortedcollections.

Sulphur answered 27/10, 2018 at 15:22 Comment(3)
oh i see.. have also tried using a heap but problem is i only get the object with minimum attribute at the first position of the list, but the rest of elements may or may not be orderedUpchurch
@jupcan: Unless you need indexed access to elements after the first, that's not an issue. For the kind of use you'd see in a search algorithm, heappop is all you need for element access.Sulphur
yeah, got u. using heaps i get almost 6million objects in few seconds.. the difference is huge problem is i still don't know if im gonna need an indexed access but if not heaps will be the way to go, thanks a lot for ur help:)Upchurch

© 2022 - 2024 — McMap. All rights reserved.