Determining if an unordered vector<T> has all unique elements
Asked Answered
R

11

41

Profiling my cpu-bound code has suggested I that spend a long time checking to see if a container contains completely unique elements. Assuming that I have some large container of unsorted elements (with < and = defined), I have two ideas on how this might be done:

The first using a set:

template <class T>
bool is_unique(vector<T> X) {
  set<T> Y(X.begin(), X.end());
  return X.size() == Y.size();
}

The second looping over the elements:

template <class T>
bool is_unique2(vector<T> X) {
  typename vector<T>::iterator i,j;
  for(i=X.begin();i!=X.end();++i) {
    for(j=i+1;j!=X.end();++j) {
      if(*i == *j) return 0;
    }
  }
  return 1;
}

I've tested them the best I can, and from what I can gather from reading the documentation about STL, the answer is (as usual), it depends. I think that in the first case, if all the elements are unique it is very quick, but if there is a large degeneracy the operation seems to take O(N^2) time. For the nested iterator approach the opposite seems to be true, it is lighting fast if X[0]==X[1] but takes (understandably) O(N^2) time if all the elements are unique.

Is there a better way to do this, perhaps a STL algorithm built for this very purpose? If not, are there any suggestions eek out a bit more efficiency?

Rillet answered 4/5, 2010 at 21:41 Comment(6)
Should the container be allowed to contain duplicates at all? Perhaps you need a set, not a vector?Vin
Your implementation if is_unique would be faster if it took as const vector<T>& as its argument instead of accepting its argument by value. That way, you avoid making a copy of the vector and then also copying that copy into a set.Fassett
@ Neil, the container needs random access (hence the vector) for other parts of the code.Rillet
@Rillet If possible, you may want to consider keeping the vector sorted all the time as you construct it. That would make is_unique (whether implemented by std::set or std::unique) run in linear time. By keeping the vector sorted, you spread out the work over time, and have to "pay" for some of the work only once per element, rather than taking a big hit by having to re-compute everything each time you call is_unique.Fassett
Can you define a hash function on your element type? If so, using a hash-table instead of a binary-tree based set should work very well, particularly if you test for membership on every insert instead of adding everything first and then checking. Note that standard STL doesn't have any hash-based containers, although hash_set is present as an extension; or you could use Boost's.Mertiemerton
In addition to tzaman's comment, I think you should also consider the average size of your dataset. Are you spending time checking for collisions in thousands of vectors of dozens of entries each, or dozens of vectors of thousands of entries each? Read any paper on search or sort algorithms written in the last 15 years, and it's hard not to find one that talks about how this or that algorithm outpaces the current champion by such and such percent in a particular list size range (say, 10k to 50k entries).Uranium
G
29

Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.

The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.

It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.

template< class T >
bool dereference_less( T const *l, T const *r )
 { return *l < *r; } 

template <class T>
bool is_unique(vector<T> const &x) {
    vector< T const * > vp;
    vp.reserve( x.size() );
    for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
    sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
    return adjacent_find( vp.begin(), vp.end(),
           not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
        == vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}

or in STL style,

template <class I>
bool is_unique(I first, I last) {
    typedef typename iterator_traits<I>::value_type T;
    …

And if you can reorder the original vector, of course,

template <class T>
bool is_unique(vector<T> &x) {
    sort( x.begin(), x.end() ); // O(N log N)
    return adjacent_find( x.begin(), x.end() ) == x.end();
}
Glyphography answered 4/5, 2010 at 21:51 Comment(8)
Putting things in as pointers is not a bad idea. This could be applied to either of the set based algorithms.Pneumoconiosis
std::ajacent_find is a much better idea than std::unique.Tarnish
This is, by far, the consistently fastest (or nearly so) example posted. Would it be to much to ask you to elaborate on the "STL-style" so that I (and others) could see how you would do it? Also running your first example gives the "error: invalid conversion from ‘const int*’ to ‘int*’" on the &x, removing the const seemed to do the trick.Rillet
Am I missing something? This edit gives the error: `error: no match for call to (std::binary_negate<dereference_less>) (int*&, int*&)’ and does not compile.Rillet
@Hooked: sorry, I really should've tested it. There was a const correctness error (missing a few consts rather than having one extra) and an issue with the functor being incompatible with not2… here I switched to the ptr_fun functor generator.Glyphography
The "STL-style" prototype is more like the way the functions in <algorithm> are written. It takes a pair of iterators rather than a container. Just substitute in those three lines and you should be good to go. (But I didn't test it ;v) .)Glyphography
Faster Big-O can be achieved by using a hash-table based set implementation, at the cost of some space overhead. O(1) inserts for n elements == O(n)Mertiemerton
@tzaman: correct, I still forget about hashtables! However the question specifies operator< rather than a hash function.Glyphography
H
10

You must sort the vector if you want to quickly determine if it has only unique elements. Otherwise the best you can do is O(n^2) runtime or O(n log n) runtime with O(n) space. I think it's best to write a function that assumes the input is sorted.

template<class Fwd>
bool is_unique(In first, In last)
{
    return adjacent_find(first, last) == last;
}

then have the client sort the vector, or a make a sorted copy of the vector. This will open a door for dynamic programming. That is, if the client sorted the vector in the past then they have the option to keep and refer to that sorted vector so they can repeat this operation for O(n) runtime.

Horwitz answered 4/5, 2010 at 21:53 Comment(2)
+1: better than std::unique. Also an implementation in the spirit of STL.Hymenopterous
Presumably Fwd was meant to be In, or vice-versa?Platonic
T
8

The standard library has std::unique, but that would require you to make a copy of the entire container (note that in both of your examples you make a copy of the entire vector as well, since you unnecessarily pass the vector by value).

template <typename T>
bool is_unique(std::vector<T> vec)
{
    std::sort(vec.begin(), vec.end());
    return std::unique(vec.begin(), vec.end()) == vec.end();
}

Whether this would be faster than using a std::set would, as you know, depend :-).

Tarnish answered 4/5, 2010 at 21:47 Comment(12)
Asymptotically speaking, this has the same space and time requirements as the is_unique in the question. They are both O(n) in space and O(n log n) in time. That is to say, their run times are dominated by sorting (explicit sorting in your example, and the sorting internal to std::set in the OP's). My suggestion would be to try both and pick whichever happens to be faster in practice.Fassett
I must be getting old. Why couldn't I see that? You didn't slip in an edit just under the 5 min. mark, did you?Billhook
@Tyler actually, no, worse, this is O(n2) runtime and O(n) space. The worst case of std::sort() is O(n2).Horwitz
@WilhelmTell: Technically sort() has no worst-case requirement at all; it does have the average-case n log n requirement, and if you really thought you'd run into perverse cases where you're not going to get real-world n log n, you can use stable_sort(), which does have an n log n upper bound requirement.Tarnish
@James correct: 25.3.1.1 paragraph 3. The only guarantee is that the average is N*log N. The cplusplus.com website mislead me! cplusplus.com/reference/algorithm/sort says worst case of sort is O(n^2)Horwitz
@WilhelmTell of Purple-Magenta: That's because std::sort can use something like introsort which has O(n^2) complexity, while std::stable_sort is usually implemented using a form of merge sort.Ataractic
@WilhelmTell: I've come to the conclusion that cplusplus.com is of pretty poor quality. I've reported a half dozen errors or so over the last few months and none of them have been corrected.Tarnish
This is stupid fast for a simple type <T>==<int>, far faster then any other example (the copying is so cheap). In general this may not be the case for me, but I'll have to keep this mind for future use!Rillet
@James Seriously? Making a mistake is not nearly as bad as not correcting the mistake! Do they reply at least?Horwitz
@WilhelmTell: I got no response at all, which is what irritated me. I don't mind reporting bugs and explaining why they need to be corrected, but if they're not going to fix them... eh. That said, I still use cplusplus.com when I need a quick lookup on something :-PTarnish
@James truth, same here. Even if Google works by voting with the mouse or the HTML. :sHorwitz
The worst case of std::sort is nlogn as explained here https://mcmap.net/q/346245/-what-is-the-time-complexity-of-std-sort-in-the-c-standard-libraryGuesthouse
U
7

Is it infeasible to just use a container that provides this "guarantee" from the get-go? Would it be useful to flag a duplicate at the time of insertion rather than at some point in the future? When I've wanted to do something like this, that's the direction I've gone; just using the set as the "primary" container, and maybe building a parallel vector if I needed to maintain the original order, but of course that makes some assumptions about memory and CPU availability...

Understrapper answered 4/5, 2010 at 21:48 Comment(0)
H
7

For one thing you could combine the advantages of both: stop building the set, if you have already discovered a duplicate:

template <class T>
bool is_unique(const std::vector<T>& vec)
{
    std::set<T> test;
    for (typename std::vector<T>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
        if (!test.insert(*it).second) {
            return false;
        }
    }
    return true;
}

BTW, Potatoswatter makes a good point that in the generic case you might want to avoid copying T, in which case you might use a std::set<const T*, dereference_less> instead.


You could of course potentially do much better if it wasn't generic. E.g if you had a vector of integers of known range, you could just mark in an array (or even bitset) if an element exists.

Hymenopterous answered 4/5, 2010 at 21:50 Comment(6)
But that's still very expensive because set uses dynamic allocations. You're essentially building a set when you don't need it. So the solution is correct mathematically, but expensive in practice.Horwitz
@WilhelmTell: Yes, but when you start by sorting the vector, you are gonna have to sort it all, which falls under the same worst case scenario as OP's #1. Also, sorting a vector can be expensive, if T is expensive to swap. - This is about finding a middle way between the worst cases of either approach. All in all, it will depend heavily on the types involved and the nature of the data - how often there are or aren't duplicates.Hymenopterous
Adding all the elements to a vector and sorting it is generally faster than inserting elements and keeping sorted order...Pugilist
I'd have to agree with WilhelmTell based off testing a worst case environment where all the elements are already sorted. In this case it seems to work much slower then the first example, when in theory, it should be about the same speed.Rillet
Or just create another vector of the same length as the original vector, but made up of T* elements. Then sort those via dereferencing.Understrapper
Or do a std::sort on the original container with a custom predicate that acts normal but throws an exception as soon as it compares two equal (not just equivalent, although you could do that too) values. If you catch the exception then you know it's not unique already, otherwise it's unique.Florentinaflorentine
A
2

You can use std::unique, but it requires the range to be sorted first:

template <class T>
bool is_unique(vector<T> X) {
  std::sort(X.begin(), X.end());
  return std::unique(X.begin(), X.end()) == X.end();
}

std::unique modifies the sequence and returns an iterator to the end of the unique set, so if that's still the end of the vector then it must be unique.

This runs in nlog(n); the same as your set example. I don't think you can theoretically guarantee to do it faster, although using a C++0x std::unordered_set instead of std::set would do it in expected linear time - but that requires that your elements be hashable as well as having operator == defined, which might not be so easy.

Also, if you're not modifying the vector in your examples, you'd improve performance by passing it by const reference, so you don't make an unnecessary copy of it.

Afebrile answered 4/5, 2010 at 21:51 Comment(0)
F
2

If I may add my own 2 cents.

First of all, as @Potatoswatter remarked, unless your elements are cheap to copy (built-in/small PODs) you'll want to use pointers to the original elements rather than copying them.

Second, there are 2 strategies available.

  1. Simply ensure there is no duplicate inserted in the first place. This means, of course, controlling the insertion, which is generally achieved by creating a dedicated class (with the vector as attribute).
  2. Whenever the property is needed, check for duplicates

I must admit I would lean toward the first. Encapsulation, clear separation of responsibilities and all that.

Anyway, there are a number of ways depending on the requirements. The first question is:

  • do we have to let the elements in the vector in a particular order or can we "mess" with them ?

If we can mess with them, I would suggest keeping the vector sorted: Loki::AssocVector should get you started. If not, then we need to keep an index on the structure to ensure this property... wait a minute: Boost.MultiIndex to the rescue ?

Thirdly: as you remarked yourself a simple linear search doubled yield a O(N2) complexity in average which is no good.

If < is already defined, then sorting is obvious, with its O(N log N) complexity. It might also be worth it to make T Hashable, because a std::tr1::hash_set could yield a better time (I know, you need a RandomAccessIterator, but if T is Hashable then it's easy to have T* Hashable to ;) )

But in the end the real issue here is that our advises are necessary generic because we lack data.

  • What is T, do you intend the algorithm to be generic ?
  • What is the number of elements ? 10, 100, 10.000, 1.000.000 ? Because asymptotic complexity is kind of moot when dealing with a few hundreds....
  • And of course: can you ensure unicity at insertion time ? Can you modify the vector itself ?
Fibered answered 5/5, 2010 at 6:49 Comment(0)
P
1

Well, your first one should only take N log(N), so it's clearly the better worse case scenario for this application.

However, you should be able to get a better best case if you check as you add things to the set:

template <class T>
bool is_unique3(vector<T> X) {
  set<T> Y;
  typename vector<T>::const_iterator i;
  for(i=X.begin(); i!=X.end(); ++i) {
    if (Y.find(*i) != Y.end()) {
      return false;
    }
    Y.insert(*i);
  }
  return true;
}

This should have O(1) best case, O(N log(N)) worst case, and average case depends on the distribution of the inputs.

Pneumoconiosis answered 4/5, 2010 at 21:51 Comment(0)
S
1

If the type T You store in Your vector is large and copying it is costly, consider creating a vector of pointers or iterators to Your vector elements. Sort it based on the element pointed to and then check for uniqueness.

You can also use the std::set for that. The template looks like this

template <class Key,class Traits=less<Key>,class Allocator=allocator<Key> > class set

I think You can provide appropriate Traits parameter and insert raw pointers for speed or implement a simple wrapper class for pointers with < operator.

Don't use the constructor for inserting into the set. Use insert method. The method (one of overloads) has a signature

pair <iterator, bool> insert(const value_type& _Val);

By checking the result (second member) You can often detect the duplicate much quicker, than if You inserted all elements.

Ski answered 4/5, 2010 at 21:56 Comment(0)
G
1

In the (very) special case of sorting discrete values with a known, not too big, maximum value N.
You should be able to start a bucket sort and simply check that the number of values in each bucket is below 2.

bool is_unique(const vector<int>& X, int N)
{
  vector<int> buckets(N,0);
  typename vector<int>::const_iterator i;
  for(i = X.begin(); i != X.end(); ++i)
    if(++buckets[*i] > 1)
      return false;
  return true;
}

The complexity of this would be O(n).

Grosgrain answered 4/5, 2010 at 22:18 Comment(0)
C
0

Using the current C++ standard containers, you have a good solution in your first example. But if you can use a hash container, you might be able to do better, as the hash set will be nO(1) instead of nO(log n) for a standard set. Of course everything will depend on the size of n and your particular library implementation.

Citystate answered 4/5, 2010 at 22:1 Comment(3)
A hashmap will give you big-theta of 1 and O(n^2).Horwitz
@wilhelmtell: That... does not sound right. Care to share your math? Inserting into a hashmap is supposed to be O(n) amortized. Assuming your hashmap has a way to detect collisions, you should know by the time you inserted the last element whether there's a collision. The only way I can think of to make it O (N^2) is if you assumed the vector checked for collisions on every insert (which I don't think was part of the question) and then only if it threw away the map after each update to the vector.Uranium
I have no clue what that means in my comment. That's O(n), and I blame the cat for that typo.Horwitz

© 2022 - 2024 — McMap. All rights reserved.