What is wrong with `std::set`?
Asked Answered
F

7

21

In the other topic I was trying to solve this problem. The problem was to remove duplicate characters from a std::string.

std::string s= "saaangeetha";

Since the order was not important, so I sorted s first, and then used std::unique and finally resized it to get the desired result:

aeghnst

That is correct!


Now I want to do the same, but at the same time I want the order of characters intact. Means, I want this output:

sangeth

So I wrote this:

template<typename T>
struct is_repeated
{
    std::set<T>  unique;
    bool operator()(T c) { return !unique.insert(c).second; }
}; 
int main() {
    std::string s= "saaangeetha";
    s.erase(std::remove_if(s.begin(), s.end(), is_repeated<char>()), s.end()); 
    std::cout << s ;
}

Which gives this output:

saangeth

That is, a is repeated, though other repetitions gone. What is wrong with the code?

Anyway I change my code a bit: (see the comment)

template<typename T>
struct is_repeated
{
    std::set<T> & unique;  //made reference!
    is_repeated(std::set<T> &s) : unique(s) {} //added line!
    bool operator()(T c) { return !unique.insert(c).second; }
}; 
int main() {
    std::string s= "saaangeetha";
    std::set<char> set; //added line!
    s.erase(std::remove_if(s.begin(),s.end(),is_repeated<char>(set)),s.end()); 
    std::cout << s ;
}

Output:

sangeth

Problem gone!

So what is wrong with the first solution?

Also, if I don't make the member variable unique reference type, then the problem doesn't go.

What is wrong with std::set or is_repeated functor? Where exactly is the problem?

I also note that if the is_repeated functor is copied somewhere, then every member of it is also copied. I don't see the problem here!

Furie answered 22/3, 2011 at 20:45 Comment(4)
As an aside, sorting the data is O(N log N). There is a much more efficient algorithm for doing this. Make a linear pass of the string, counting the occurrences of each character. Make a second pass of the string and only output if the count is 1. This is O(N).Drayton
You do not need two passes. Simply make a bool array and check if bFound[c] == true if false you append it to the result string and set bFound[c]=true. This way you build up a filter and use it in one pass.Trite
@Jeff: Or use an unordered_set instead of a set.Extortioner
@Kenny: Unfortunately, that is not working. It still reorders :-sFurie
E
15

In GCC (libstdc++), remove_if is implemented essentially as

    template<typename It, typename Pred>
    It remove_if(It first, It last, Pred predicate) {
      first = std::find_if(first, last, predicate);
    //                                  ^^^^^^^^^
      if (first == last)
         return first;
      else {
         It result = first;
         ++ result;
         for (; first != last; ++ first) {
           if (!predicate(*first)) {
    //          ^^^^^^^^^
              *result = std::move(*first);
              ++ result;
           }
         }
      }
    }

Note that your predicate is passed by-value to find_if, so the struct, and therefore the set, modified inside find_if will not be propagated back to caller.

Since the first duplicate appears at:

  saaangeetha
//  ^

The initial "sa" will be kept after the find_if call. Meanwhile, the predicate's set is empty (the insertions within find_if are local). Therefore the loop afterwards will keep the 3rd a.

   sa | angeth
// ^^   ^^^^^^
// ||   kept by the loop in remove_if
// ||
// kept by find_if
Extortioner answered 22/3, 2011 at 21:7 Comment(3)
+1. This explains the behaviour. It all depends on the implementation of std::remove_if.Furie
Alright. The bottomline of all these analysis, and learning is that : functor must not have state. Right?Furie
Not necessarily. For example, the for_each algorithm accepts a functor and then returns it after applying it everywhere, with the understanding that it should indeed be modified. However, predicates and comparison functors should not have state, since they're supposed to represent mathematical comparisons, which should not mutate over time.Holbert
H
17

Functors are supposed to be designed in a way where a copy of a functor is identical to the original functor. That is, if you make a copy of one functor and then perform a sequence of operations, the result should be the same no matter which functor you use, or even if you interleave the two functors. This gives the STL implementation the flexibility to copy functors and pass them around as it sees fit.

With your first functor, this claim does not hold because if I copy your functor and then call it, the changes you make to its stored set do not reflect in the original functor, so the copy and the original will perform differently. Similarly, if you take your second functor and make it not store its set by reference, the two copies of the functor will not behave identically.

The reason that your final version of the functor works, though, is because the fact that the set is stored by reference means that any number of copies of tue functor will behave identically to one another.

Hope this helps!

Holbert answered 22/3, 2011 at 20:53 Comment(4)
You mean std::remove_if makes copy of the functor? Can you show me the code? This implementation of std::remove_if doesn't make copy : cplusplus.com/reference/algorithm/remove_ifFurie
That really depends on your implementation of the STL; I don't think that the standard says whether or not it's supposed to or must make a copy. However, I'm pretty sure that the standard says that it can make a copy, and based on the output it very much seems like it is doing just that.Holbert
The answer is correct. In general your functors should not have state because they can get copied.Klopstock
@Nawaz, that implementation you referenced makes at least one copy; Predicate pred is passed by value. However, based on your results, I suspect you are using VS2010 or related; that implementation of remove_if first calls find_if, then an internal implementation of remove_if. That makes two additional copies of pred which would explain your results.Twosome
E
15

In GCC (libstdc++), remove_if is implemented essentially as

    template<typename It, typename Pred>
    It remove_if(It first, It last, Pred predicate) {
      first = std::find_if(first, last, predicate);
    //                                  ^^^^^^^^^
      if (first == last)
         return first;
      else {
         It result = first;
         ++ result;
         for (; first != last; ++ first) {
           if (!predicate(*first)) {
    //          ^^^^^^^^^
              *result = std::move(*first);
              ++ result;
           }
         }
      }
    }

Note that your predicate is passed by-value to find_if, so the struct, and therefore the set, modified inside find_if will not be propagated back to caller.

Since the first duplicate appears at:

  saaangeetha
//  ^

The initial "sa" will be kept after the find_if call. Meanwhile, the predicate's set is empty (the insertions within find_if are local). Therefore the loop afterwards will keep the 3rd a.

   sa | angeth
// ^^   ^^^^^^
// ||   kept by the loop in remove_if
// ||
// kept by find_if
Extortioner answered 22/3, 2011 at 21:7 Comment(3)
+1. This explains the behaviour. It all depends on the implementation of std::remove_if.Furie
Alright. The bottomline of all these analysis, and learning is that : functor must not have state. Right?Furie
Not necessarily. For example, the for_each algorithm accepts a functor and then returns it after applying it everywhere, with the understanding that it should indeed be modified. However, predicates and comparison functors should not have state, since they're supposed to represent mathematical comparisons, which should not mutate over time.Holbert
M
5

Not really an answer, but as another interesting tidbit to consider, this does work, even though it uses the original functor:

#include <set>
#include <iostream>
#include <string>
#include <algorithm>
#include <iterator>

template<typename T>
struct is_repeated {
    std::set<T>  unique;
    bool operator()(T c) { return !unique.insert(c).second; }
}; 
int main() {
    std::string s= "saaangeetha";
    std::remove_copy_if(s.begin(), s.end(), 
                        std::ostream_iterator<char>(std::cout), 
                        is_repeated<char>());
    return 0;
}

Edit: I don't think it affects this behavior, but I've also corrected a minor slip in your functor (operator() should apparently take a parameter of type T, not char).

Malposition answered 22/3, 2011 at 21:7 Comment(0)
M
4

I suppose the problem could lie in that the is_repeated functor is copied somewhere inside the implementation of std::remove_if. If that is the case, the default copy constructor is used and this in turn calls std::set copy constructor. You end up with two is_repeated functors possibly used independently. However as the sets in both of them are distinct objects, they don't see the mutual changes. If you turn the field is_repeated::unique to a reference, then the copied functor still uses the original set which is what you want in this case.

Manos answered 22/3, 2011 at 20:51 Comment(5)
I edited my question. Please read the last line where I said : if the is_repeated functor is copied somewhere, then every member of it is also copied. I don't see the problem here!Furie
As I wrote, if the members are copied you have two unique sets. Both of them could be empty initially. Then both of them may be used to check whether character a is present.Manos
@KennyTM: But copy doesn't cause the problem. I mean when the function copied, every member of it is copied, right?Furie
a) instance no. 1 is copied into instance no. 2 b) instance no. 1 is testing a and adds it to its unique set. c) instance no. 2 does the same. As it uses another set, it still doesn't contain any a character and returns a result that causes the problem.Manos
@Nawaz: No it does, since the copy is done on the empty set. See my answer.Extortioner
K
2

Functor classes should be pure functions and have no state of their own. See item 39 in Scott Meyer's Effective STL book for a good explanation on this. But the gist of it is that your functor class may be copied 1 or more times inside the algorithm.

Klopstock answered 22/3, 2011 at 21:21 Comment(0)
V
1

The other answers are correct, in that the issue is that the functor that you are using is not copyable safe. In particular, the STL that comes with gcc (4.2) implements std::remove_if as a combination of std::find_if to locate the first element to delete followed by a std::remove_copy_if to complete the operation.

template <typename ForwardIterator, typename Predicate>
std::remove_if( ForwardIterator first, ForwardIterator end, Predicate pred ) {
   first = std::find_if( first, end, pred ); // [1]
   ForwardIterator i = it;
   return first == last? first 
          : std::remove_copy_if( ++i, end, fist, pred ); // [2]
}

The copy in [1] means that the first element found is added to the copy of the functor and that means that the first 'a' will be lost in oblivion. The functor is also copied in [2], and that would be fine if it were not because the original for that copy is an empty functor.

Vacuole answered 22/3, 2011 at 21:38 Comment(0)
T
1

Depending on the implementation of remove_if can make copies of your predicate. Either refactor your functor and make it stateless or use Boost.Ref to "for passing references to function templates (algorithms) that would usually take copies of their arguments", like so:

#include <set>
#include <iostream>
#include <string>
#include <algorithm>
#include <iterator>

#include <boost/ref.hpp>
#include <boost/bind.hpp>

template<typename T>
struct is_repeated {
    std::set<T>  unique;
    bool operator()(T c) { return !unique.insert(c).second; }
}; 

int main() {
    std::string s= "saaangeetha";
    s.erase(std::remove_if(s.begin(), s.end(), boost::bind<bool>(boost::ref(is_repeated<char>()),_1)), s.end());
    std::cout << s;

    return 0;
}
Teacake answered 22/3, 2011 at 23:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.