__eq__() called multiple times instead of once in nested data structure
Asked Answered
C

2

10

Once or twice a year, I run into the following issue: I have some type on which comparison operations might be expensive (e.g. the values are to big to keep in memory and need to be loaded form disk or equality is hard to compute because a single value may have many representations, think chemical formulas). This type is part of a nested data structure (e.g. nested lists or tuples or some tree). Sometimes I notice that the comparison operators (__lt__ etc.) on my type are called multiple times for the same values for a single comparison.

I'll try to illustrate the problem with the following example:

class X:
    comparisons = 0

    def __init__(self, value):
        self.value = value

    def __lt__(self, other):
        return self.value < other.value

    def __gt__(self, other):
        return self.value > other.value

    def __eq__(self, other):
        X.comparisons += 1
        return self.value == other.value

def nest_a_hundred_times(value):
    for i in range(100): value = [value]
    return value

print(nest_a_hundred_times(X(1)) < nest_a_hundred_times(X(0)))
print(X.comparisons)

In this example, X is my type with expensive comparison operations, I'm just counting the number of times that __eq__() is called but the other operations might be expensive as well. Two unequal values of the type are created and nested in single-element lists a bunch of times.

Running the example prints False, 100. So __eq__() has been called 100 times.

I know why this happens: the built-in comparison function for list objects compares individual list elements for equality first, to find at which index two lists differ, before comparing those elements for ordering. I think that it's actually impossible to avoid this problem when only using the six comparison operators (==, !=, <, <=, >, >=) as an interface between types defining an ordering. As an example for an alternative approach, Haskell has an Ord class which defines an ordering function to compare two values. This allows a nested data structure to be compared by calling ordering only once on each node.

My question is: How do I avoid this problem in Python? Is it possible to avoid the problem with the comparison operators defined by Python alone, contrary to my belief? (I'm trying to avoid some sort of caching of results because this is not a performance problem, it's an algorithmic problem). Or do I need to build my own data structures (lists, tuples) and implement an ordering function on them?

Clandestine answered 8/4, 2021 at 18:46 Comment(7)
The problem is that each list's list.__lt__ call needs to perform both an == call and a < call on the next nested list. It's a performance regression associated with the switch from 3-way comparisons to rich comparisons.Santosantonica
(You can see the code here.)Santosantonica
@user2357112supportsMonica Thanks. I'm aware of that, though.Clandestine
It's very good to have that fact mentioned somewhere here, though (or at least to link somewhere that offers an explanation), since the question can be much more confusing for people who don't understand why __eq__ is called all those times. Can I suggest editing either an explanation or link (or both) into the question?Tore
@DavidZ If you have edit privileges, please do so! Otherwise I'll see that I can edit the question and add that info myself sometime later.Clandestine
Done, thanks! I just wanted to make sure you had the opportunity to make the edit yourself if you wanted to, since it's your post and this is a more substantial edit than just fixing a typo.Tore
@DavidZ Thank you for being so kind. :) I get what you mean. I myself am wishing for a "propose this edit to the original author" feature available to everyone from time to time. Like a Pull Request on GitHub.Clandestine
P
0

From the way your question is posed I'm assuming:

  • you would prefer not to overwrite list.__eq__ if possible. (I also would not want to, this is a pretty dangerous idea).
  • You are okay with overwriting the dunder (__) methods of X if needed. (As far as I can tell it is needed).

As you kind of hinted at, because you are trying to work around the implementation of a builtin operation on a builtin type, I don't think any solution is going to be particularly clean (but hey maybe other answers will surprise me).

An interesting thing I found is that if you overwrite X.__eq__ to return True, it only gets called once.

class X:
    ...
    def __eq__(self, other):
        X.comparisons += 1
        return True

Now obviously this could create some issues, seeing as it would make X(1)==X(0) and nest_a_hundred_times(X(1)) == nest_a_hundred_times(X(0)). However, it would make < and > work and be much more efficient, so if you are positive you never need to use ==, I think this could be one way to go.

Beyond that, all I could come up with was an admittedly messy hack to try to detect whether __eq__ was called by < or '>' or by ==...

import inspect

class X:
    ...
    def __eq__(self, other):
        X.comparisons += 1
        f = inspect.currentframe().f_back
        fi = inspect.getframeinfo(f)
        line_called_from = fi[-2][0]
        called_from_lt = ('<' in line_called_from  or '>' in line_called_from) and '==' not in line_called_from and 'eq(' not in line_called_from
        if called_from_lt:
            return True
        return self.value == other.value
Perilymph answered 27/10, 2022 at 2:47 Comment(0)
S
0

Your best option given the way list comparisons work might be to either:

a) If possible, when first needed, summarize and cache each value with a unique hash (e.g., a sortable string) that can be compared as a proxy for the heavyweight data you would otherwise compare; or

b) maintain a cache/memoization of recent comparison results based on object identities, and if a hit is found, just return the same result.

Southernly answered 23/5, 2023 at 19:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.