std::set with user defined type, how to ensure no duplicates
Asked Answered
C

6

71

So I have an std::set which needs to keep specific ordering as well as not allowing duplicates of a user defined (by me) type. Now I can get the order to work correctly by overloading the '<' operator in my type. However, the set does not appropriately detect duplicates, and to be honest I'm not entirely sure how it does this internally. I have overloaded the '==' operator, but somehow im not sure this is what the set is actually using? So the question is how does the set determine duplicates when you add values? Here is the relevant code:

The user defined type:

//! An element used in the route calculation.
struct RouteElem {
    int shortestToHere; // Shortest distance from the start.
    int heuristic;      // The heuristic estimate to the goal.
    Coordinate position;
    bool operator<( const RouteElem& other ) const
    {
        return (heuristic+shortestToHere) < (other.heuristic+other.shortestToHere);
    }
    bool operator==( const RouteElem& other ) const
    {
        return (position.x == other.position.x && position.y == other.position.y);
    }
};

So the elements are equivalent when their position is equivalent, and an element is less than another if its combined functional is less than that of the other. The sorting works, but the set will accept two elements of the same position.

Chanel answered 11/7, 2009 at 22:50 Comment(0)
L
121

operator== is not used by std::set. Elements a and b are considered equal iff !(a < b) && !(b < a)

Lyontine answered 11/7, 2009 at 22:53 Comment(6)
I see, thats unfortunate. Is there another std container which might work better? Ie. allow me to define my own 'non-duplicate' criteria as well as sorting criteria?Chanel
@DeusAduro: It's not possible to separate the comparison function from sorting function. The reason is obvious. A set is a binary tree internally. If an element satisfies equality criteria (not bigger, nor smaller), it should be in the same place.Revocable
Hmm I understand, what I want to do would require a complete traversal of every node in the tree, to verify a different equality condition. Since this isn't can't be done and I need both 'equality' and 'sorting' criteria, I don't think a set is appropriate.Chanel
A set is probably inappropriate if you define equality in a different sense than ordering. Equality in set essentially means the two element will have the same place in sorted sequence of items.Revocable
@Paul: std::unique requires the elements to already be sorted such that equal elements are adjacent. I think that kind of begs the question considering the problem DeusAduro is trying to solve, which is that his criterion for duplication doesn't match his sort order.Maxinemaxiskirt
@mmx Assumed Item e can be inserted, question is, which place should it occupy ? which one comes first ? (important when we iterate it )Semiliquid
R
31

std::set supports specifying a comparison function. The default is less which will use operator < to check equality. You can define a custom function to check equality and use that one instead:

std::set<RouteElem, mycomparefunction> myset; 

Note that it's not possible to separate the comparison function from sorting function. std::set is a binary tree and if an element in a binary tree is not bigger nor smaller than a specific element, it should be in the same place. It does something like this in the place finding algorithm:

if (a < b) {
    // check the left subtree
} else if (b < a) {
    // check the right subtree
} else {
    // the element should be placed here.
}
Revocable answered 11/7, 2009 at 22:56 Comment(5)
The comparison function in the set template parameters is a less-than comparison in case the things you're comparing don't already have such a comparison member function. This is not an equality test function. See: sgi.com/tech/stl/set.htmlRoundshouldered
Indeed. It uses the less than function both for ordering and as a side effect, equality.Revocable
That's true, and as a consequence it doesn't solve the problem stated in the question. All it does is move the less-than function implementation elsewhere.Roundshouldered
No, it doesn't. This is the nature of set class. I don't think this data structure suits the problem (for the reason I mentioned in the last paragraph).Revocable
Yes, maybe not exactly what I needed to know! But informative none the less. I think I will be applying my own type, I need a good amount of speed in this application (2-d routing algorithm), none of the std containers seem particularly appropriate. At least not without applying additional sorting/uniqueness criteria outside of the containers normal functionality.Chanel
M
8

rlbond's comparator does not prevent the insertion of elements which compare equal. Apparently it's difficult to prove this in comments, given the character limit, because rlbond appears to thinks that std::set guarantees that it will never contain two elements with !compare(a,b) && !compare(b,a) for his comparator. However, rlbond's comparator does not define a strict order, and therefore is not a valid parameter to std::set.

#include <set>
#include <iostream>
#include <iterator>
#include <algorithm>

struct BrokenOrder {
    int order;
    int equality;

    public:
    BrokenOrder(int o, int e) : order(o), equality(e) {}

    bool operator<(const BrokenOrder &rhs) const {
        return order < rhs.order;
    }
    bool operator==(const BrokenOrder &rhs) const {
        return equality == rhs.equality;
    }
};

std::ostream &operator<<(std::ostream &stream, const BrokenOrder &b) {
    return stream << b.equality;
}

// rlbond's magic comparator
struct LessThan : public std::binary_function<BrokenOrder, BrokenOrder, bool> {
    bool operator()(const BrokenOrder& lhs, const BrokenOrder& rhs) const
    {
        return !(lhs == rhs) && (lhs < rhs);
    }
};

int main() {
    std::set<BrokenOrder,LessThan> s;
    for (int i = 0; i < 5; ++i) {
        s.insert(BrokenOrder(i,i));
    }
    for (int i = 0; i < 5; ++i) {
        s.insert(BrokenOrder(10-i,i));
    }
    std::copy(s.begin(), s.end(), 
        std::ostream_iterator<BrokenOrder>(std::cout, "\n"));
}

Output:

0
1
2
3
4
3
2
1
0

Duplicates. The magic comparator has failed. Different elements in the set have the same value of equality, and hence compare the same with operator==, because during insertion the set never compared the new element against its duplicate. The only duplicate which was excluded was 4, because the two 4's had sort orders 4 and 6. This put them close enough together in the set to be compared against each other.

From the C++ standard: 25.3:3 "For the algorithms to work correctly, comp has to induce a strict weak ordering on the values".

25.3:4 "... the requirements are that comp and equiv both be transitive relations:

comp(a,b) && comp(b,c) implies comp(a,c)"

Now, consider the elements a = BrokenOrder(1,1), b = BrokenOrder(2,2), and c = BrokenOrder(9,1), and comp of course equal to the magic comparator. Then:

  • comp(a,b) is true since 1 != 2 (equality) and 1 < 2 (order)
  • comp(b,c) is true since 2 != 1 (equality) and 2 < 9 (order)
  • comp(a,c) is false since 1 == 1 (equality)
Maxinemaxiskirt answered 11/7, 2009 at 22:50 Comment(3)
I stand corrected. You're right, it's hard to understand what were you were trying to say in comments. I understand now.Guffaw
@rlbond: I wasn't expecting that you'd delete your answer: what you say about A* looked really useful, since it showed that the OP didn't actually want to solve the problem he posed (i.e. he didn't want to exclude duplicates, he wanted to choose the best of duplicates).Maxinemaxiskirt
@rlbond: well done, one of the best explanations of strict weak ordering I've read.Mundford
R
7

The STL set implementation does something conceptually like this to detect equality:

bool equal = !(a < b) && !(b < a);

That is, if two elements are both not less than the other, then they must be equal. You may be able to check this by setting a breakpoint on your operator==() method and checking to see whether it ever gets called at all.

I would generally be suspicious of comparison operators that check completely different things. Your < operator is defined in terms of two things that are separate from how your == operator is defined. Generally you will want such comparisons to use consistent information.

Roundshouldered answered 11/7, 2009 at 22:54 Comment(0)
M
2

You could try something like the following:

//! An element used in the route calculation.
struct RouteElem {
    int shortestToHere; // Shortest distance from the start.
    int heuristic;              // The heuristic estimate to the goal.
    Coordinate position;
    bool operator<( const RouteElem& other ) const
    {
      return (heuristic+shortestToHere) < (other.heuristic+other.shortestToHere);
    }
    bool operator==( const RouteElem& other ) const
    {
      return (position.x == other.position.x && position.y == other.position.y);
    }
};

struct CompareByPosition {
    bool operator()(const RouteElem &lhs, const RouteElem &rhs) {
        if (lhs.position.x != rhs.position.x) 
            return lhs.position.x < rhs.position.x;
        return lhs.position.y < rhs.position.y;
    }
};

// first, use std::set to remove duplicates
std::set<RouteElem,CompareByPosition> routeset;
// ... add each RouteElem to the set ...

// now copy the RouteElems into a vector
std::vector<RouteElem> routevec(routeset.begin(), routeset.end());

// now sort via operator<
std::sort(routevec.begin(), routevec.end());

Obviously there's the copy in the middle, which looks slow. But any structure which indexes items by two different criteria is therefore going to have some kind of extra overhead per item compared with a set. The whole of the code above is O(n log n), assuming that your implementation of std::sort uses introsort.

If you have it, under this scheme you could use unordered_set instead of set to do the initial uniqueifying. Since the hash would only have to depend on x and y, it should be faster than the O(log N) comparisons required to insert into a set.

Edit: just noticed that you said you wanted to "keep" sort order, not that you wanted to process everything in a batch. Sorry about that. If you want to efficiently maintain order and exclude duplicates while adding elements, then I would recommend using the set or unordered set I define above, based on position, and also a std::multiset<RouteElem>, which will maintain the operator< order. For each new element, do:

if (routeset.insert(elem).second) {
    routemultiset.insert(elem);
}

Although beware that this offers no exception guarantee. If the second insert throws, then the routeset has been modified, so the state is no longer consistent. So I guess really you need:

if (routeset.insert(elem).second) {
    try {
        routemultiset.insert(elem); // I assume strong exception guarantee
    } catch(...) {
        routeset.erase(elem); // I assume nothrow. Maybe should check those.
        throw;
    }
}

Or an equivalent with RAII, which will be more verbose if there's only one place in your code you ever use the RAII class, but better if there's much repetition.

Maxinemaxiskirt answered 11/7, 2009 at 23:27 Comment(4)
At the same time, implementing a linear array I think I can do better. An insert operation must first check for duplicate, given that the sorted order does not correlate with the 'duplicate' criteria this must be O(n). If the element is not within the list we must now find the appropriate position (binary search -> O(log n). So the total time is O(n + log n) + at worst O(n) shifts.Chanel
Possibly I misunderstand what you're suggesting, but if each insert operation involves an O(n) check for duplicate, then the total is O(n^2), since there are n items to insert.Maxinemaxiskirt
Possibly its my misunderstanding of Big O notation. But if mine is O(n^2) by definition then yours will be as well, at least in my use. Basically I add a single element to the list, after which it needs to be sorted. So using the implementation above I would be converting b/w set and vector for each operation, so we would have O(n^2) copy operations for n elements. Although yes the above implementation is better if I was to add all the elements in at a single point, then do the copy/sort only once.Chanel
It was me who was confused on that point, you're quite right that my first suggestion is rubbish for maintaining the sort across multiple separate insert operations. I think I've addressed it now in an edit. My Big-O was for inserting all n elements, yours was for inserting 1. My new suggestion is O(log N) for inserting a new element while rejecting duplicates and maintaining the order, but of uses roughly twice as much memory as a single container.Maxinemaxiskirt
G
0

Beware of the ramifications of this. It looks like you are trying to do something like A*, and if you try to insert a "duplicate" it will be ignored, even if there is a "better" route.

NOTE: This solution doesn't work, see onebyone's explanation below

struct RouteElem 
{
    int shortestToHere; // Shortest distance from the start.
    int heuristic;              // The heuristic estimate to the goal.
    Coordinate position;
    bool operator<( const RouteElem& other ) const
    {
        return (heuristic+shortestToHere) < (other.heuristic+other.shortestToHere);
    }
    bool operator==( const RouteElem& other ) const
    {
        return (position.x == other.position.x && position.y == other.position.y);
    }
};

struct RouteElemLessThan : public std::binary_function<RouteElem, RouteElem, bool>
{
    bool operator()(const RouteElem& lhs, const RouteElem& rhs) const
    {
        return !(lhs == rhs) && (lhs < rhs);
    }
};

std::set<RouteElem, RouteElemLessThan> my_set;
Guffaw answered 11/7, 2009 at 23:45 Comment(8)
"hen you insert a new RouteElem, it will be detected as equal to one already there if they are equal using operator==". Not necessarily. The log(N) comparisons performed to place the new element into the tree might just so happen not to include a comparison against the element it's equal to.Maxinemaxiskirt
I agree with onebyone here, like some of the other posters mentioned given that the set is a tree internally the equivalent element might be missed. As far as your last sentence goes, I am doing A*. I'm not sure I follow what you are saying. It sounds like you are saying that if I add a duplicate to the open list, thats not an issue? That would make things simpler for sure... could you elaborate?Chanel
@onebyone look at the comparison functor being used.Guffaw
@Chanel if you're trying to do A*, you're doing it wrong. You want to use the std::priority_queue adaptor for the open set and a std::set for the closed set. Then, you take the top element of the open set; if it's in the closed set already, discard it, otherwise, process it. The problem is that the rules for A* says that in the open set, already existing nodes should be REPLACED by ones with better g values (or equivalently, f values). If you use std::set, "worse" routes will never be replaced by better ones.Guffaw
There are other ways to do it with a std::set, but a priority queue is a lot easier IMO. You should be able to do it in around 30-40 lines for the main body function with priority_queue.Guffaw
Yes thank you, I have the algorithm implemented now using my own defined 'OrderedSet'. The actual algorithm is only 50 lines. And yes I definitely check if the node is already within the open list, if so I update it to be the one with the best functional.Chanel
@rlbond: Yes, I looked at the comparator. I also considered the structure of a binary tree. An element inserted into a tree is only compared against about log N of those already there, not all N. What reason is there to believe that one of those comparisons is with the element that's equal via operator== to the one being added? If the new element only has to be compared with non-equal elements in order to place it in the order, but there is an equal element elsewhere in the tree, then your comparator is no different than just using operator<.Maxinemaxiskirt
Please read it yourself, paying close attention to the requirements on the comparator parameter (follow up the reference to 23.1.2 "Associative containers"). If you still don't get it, try my sample code: #1115356Maxinemaxiskirt

© 2022 - 2024 — McMap. All rights reserved.