Once or twice a year, I run into the following issue: I have some type on which comparison operations might be expensive (e.g. the values are to big to keep in memory and need to be loaded form disk or equality is hard to compute because a single value may have many representations, think chemical formulas). This type is part of a nested data structure (e.g. nested lists or tuples or some tree). Sometimes I notice that the comparison operators (__lt__
etc.) on my type are called multiple times for the same values for a single comparison.
I'll try to illustrate the problem with the following example:
class X:
comparisons = 0
def __init__(self, value):
self.value = value
def __lt__(self, other):
return self.value < other.value
def __gt__(self, other):
return self.value > other.value
def __eq__(self, other):
X.comparisons += 1
return self.value == other.value
def nest_a_hundred_times(value):
for i in range(100): value = [value]
return value
print(nest_a_hundred_times(X(1)) < nest_a_hundred_times(X(0)))
print(X.comparisons)
In this example, X
is my type with expensive comparison operations, I'm just counting the number of times that __eq__()
is called but the other operations might be expensive as well. Two unequal values of the type are created and nested in single-element lists a bunch of times.
Running the example prints False
, 100
. So __eq__()
has been called 100 times.
I know why this happens: the built-in comparison function for list
objects compares individual list elements for equality first, to find at which index two lists differ, before comparing those elements for ordering. I think that it's actually impossible to avoid this problem when only using the six comparison operators (==
, !=
, <
, <=
, >
, >=
) as an interface between types defining an ordering. As an example for an alternative approach, Haskell has an Ord
class which defines an ordering
function to compare two values. This allows a nested data structure to be compared by calling ordering
only once on each node.
My question is: How do I avoid this problem in Python? Is it possible to avoid the problem with the comparison operators defined by Python alone, contrary to my belief? (I'm trying to avoid some sort of caching of results because this is not a performance problem, it's an algorithmic problem). Or do I need to build my own data structures (lists, tuples) and implement an ordering
function on them?
list.__lt__
call needs to perform both an==
call and a<
call on the next nested list. It's a performance regression associated with the switch from 3-way comparisons to rich comparisons. – Santosantonica__eq__
is called all those times. Can I suggest editing either an explanation or link (or both) into the question? – Tore