What's the most efficient way to erase duplicates and sort a vector?

Asked 25/6, 2009 at 0:28 Answered 12/3, 2024 at 10:37

375

I need to take a C++ vector with potentially a lot of elements, erase duplicates, and sort it.

I currently have the below code, but it doesn't work.

vec.erase(
      std::unique(vec.begin(), vec.end()),
      vec.end());
std::sort(vec.begin(), vec.end());

How can I correctly do this?

Additionally, is it faster to erase the duplicates first (similar to coded above) or perform the sort first? If I do perform the sort first, is it guaranteed to remain sorted after std::unique is executed?

Or is there another (perhaps more efficient) way to do all this?

Mirepoix answered 25/6, 2009 at 0:28 Comment(5)

I assume you don't have the option of checking before insert to avoid having dupes in the first place? – Jenness 25/6, 2009 at 0:48

Correct. That would be ideal. – Mirepoix 26/6, 2009 at 0:0

I would suggest correcting the code above, or really indicate that it's WRONG. std::unique assumes the range is already sorted. – Fiden 19/10, 2009 at 9:23

Using a set instead – Shirring 9/10, 2020 at 10:2

You must first use sort and then erase+unique – Apices 1/2, 2022 at 18:7

741

I agree with R. Pate and Todd Gardner; a std::set might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.

Let's compare three approaches:

Just using vector, sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

Convert to set (manually)

set<int> s;
unsigned size = vec.size();
for( unsigned i = 0; i < size; ++i ) s.insert( vec[i] );
vec.assign( s.begin(), s.end() );

Convert to set (using a constructor)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

Here's how these perform as the number of duplicates changes:

comparison of vector and set approaches

Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.

And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.

Triglyceride answered 25/6, 2009 at 2:45 Comment(24)

Why convert back to a vector? – Conatus 25/6, 2009 at 3:7

I'm shocked that the constructor approach is consistently measurably worse than manual. You would that apart from some tiny constant overhead, it would just do the manual thing. Can anyone explain this? – Aubine 25/6, 2009 at 5:51

Question: Could this line in your first approach: vec.erase( unique( vec.begin(), vec.end() ), vec.end() ); be replaced with: with: vec.resize( unique (vec.begin(), vec.end()) - vec.begin() ) and if so, would it be any faster in large vector? – Vedavedalia 25/6, 2009 at 16:29

@Dan: Sure, I think resize would work. It didn't seem any faster on a little test data, but a more thorough experiment might prove otherwise. – Triglyceride 25/6, 2009 at 23:25

Cool, thanks for the graph. Could you give a sense of what the units are for Number of Duplicates? (ie, around how large is "large enough")? – Mirepoix 26/6, 2009 at 0:2

@Kyle: It's pretty large. I used datasets of 1,000,000 randomly drawn integers between 1 and 1000, 100, and 10 for this graph. – Triglyceride 26/6, 2009 at 15:10

I agree with @Ari. What compiler was used for this? – Electrophilic 16/2, 2013 at 2:13

@ViktorDahl: it's been a while, but I'm guessing it was some version of gcc on cygwin. I just tried again and saw a similar result with clang-3.2, but not with gcc-4.8, so it could very will be a compiler peculiarity. – Triglyceride 16/2, 2013 at 14:52

@NateKohl Thank you for your reply. I found it quite surprising that the constructor method was constructor method was slower. Even more surprising that is still is in 2013. – Electrophilic 16/2, 2013 at 19:14

I found that for 5000 elements (on my machine), inserting into an unordered_set, then copying to a vector & sorting the vector is faster than using set. – Cirque 7/6, 2013 at 22:23

I have a question about this statement: "vec.erase( unique( vec.begin(), vec.end() ), vec.end() );". Does the correctness of this one rely on the order of function argument evaluation? What if a compiler decides to evaluate vec.end() first then the unique() function? – Pinebrook 6/9, 2013 at 2:19

@YangChi: argument evaluation order doesn't matter because std::unique only moves elements around without changing the physical container size. The value returned by std::unique is the new logical end of the container, but the container will still have (unspecified) values up to vec.end(). – Triglyceride 6/9, 2013 at 12:5

@NateKohl Right. I just realize the mistake I made. So obvious. :) Thank you! – Pinebrook 6/9, 2013 at 16:38

I think your results are wrong. In my tests the more duplicated elements the faster vector (comparative) is, actually scales the other way around. Did you compile with optimizations on and runtime checks off?. On my side vector is always faster, up to 100x depending on the number of duplicates. VS2013, cl /Ox -D_SECURE_SCL=0. – Riplex 9/7, 2014 at 9:20

@davidnr: it's been a while, so I don't have the original setup around anymore. I seem to remember that a set-based approach only became better when the number of duplicates was fairly high -- have you tried an extreme case with e.g. almost all duplicates? – Triglyceride 9/7, 2014 at 13:31

Yes, it is consistently faster, (I tried from rand()%2 to rand()%32000). Except in the case where the size of the object is large. I've been thinking, and the allocator you're using may improve performance on the set<> tests. Anyway set is faster when using large objects instead of int (bigger than 200 bytes in my tests). Also checked std::binary_search() and is faster on sorted vector than set::find(). – Riplex 10/7, 2014 at 8:37

Link to the image is broken, anyone has a new one? – Eudemonics 18/11, 2015 at 17:15

@GiovanniFunchal: seems fine now, but I moved the image over to imgur.com just for safety's sake. – Triglyceride 18/11, 2015 at 22:2

Is there a reason in the first experiment that the vector is first sorted, then made unique? It seems like there might be some performance to be gained if it were made unique first then sorted. – Licentiate 27/1, 2016 at 12:18

Description of the x-axis seems to be missing. – Biernat 13/5, 2016 at 13:34

This would be amazing info if we had the number of elements. Whenever I see this I think of imgs.xkcd.com/comics/convincing.png – Elah 27/8, 2020 at 9:54

@Licentiate You have to sort before, since unique assumes the container is already sorted. (Or rather, it only removes adjacent duplicates!) – Ironware 10/3, 2021 at 8:6

I also do not believe this is correct: jmeiners.com/efficient-programming-with-components/… – Accident 23/9, 2022 at 17:7

@SaianshSingh I've rolled back the edit, it makes style in different snippets inconsistent. – Pridemore 23/1, 2023 at 12:13

116

I redid Nate Kohl's profiling and got different results. For my test case, directly sorting the vector is always more efficient than using a set. I added a new more efficient method, using an unordered_set.

Keep in mind that the unordered_set method only works if you have a good hash function for the type you need uniqued and sorted. For ints, this is easy! (The standard library provides a default hash which is simply the identity function.) Also, don't forget to sort at the end since unordered_set is, well, unordered :)

I did some digging inside the set and unordered_set implementation and discovered that the constructor actually construct a new node for every element, before checking its value to determine if it should actually be inserted (in Visual Studio implementation, at least).

Here are the 5 methods:

f1: Just using vector, sort + unique

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

f2: Convert to set (using a constructor)

set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );

f3: Convert to set (manually)

set<int> s;
for (int i : vec)
    s.insert(i);
vec.assign( s.begin(), s.end() );

f4: Convert to unordered_set (using a constructor)

unordered_set<int> s( vec.begin(), vec.end() );
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

f5: Convert to unordered_set (manually)

unordered_set<int> s;
for (int i : vec)
    s.insert(i);
vec.assign( s.begin(), s.end() );
sort( vec.begin(), vec.end() );

I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000]

The results (in seconds, smaller is better):

range         f1       f2       f3       f4      f5
[1,10]      1.6821   7.6804   2.8232   6.2634  0.7980
[1,1000]    5.0773  13.3658   8.2235   7.6884  1.9861
[1,100000]  8.7955  32.1148  26.5485  13.3278  3.9822

Segno answered 29/6, 2014 at 14:31 Comment(10)

For integers, you can use radix sort, which is much faster than std::sort. – Sender 18/7, 2016 at 6:46

Quick tip, to use sort or unique methods, you have to #include <algorithm> – Hummingbird 20/6, 2017 at 13:28

@ChangmingSun I wonder why the optimizer seemed to fail on f4? The numbers are dramatically different to f5. It doesn't make any sense to me. – Disentitle 6/9, 2018 at 13:51

@Disentitle As explained in my answer, the implementation build a node (including dynamic allocation) for each element from the input sequence, which is wasteful for every value that ends up being a duplicate. There is no way the optimizer could know it could skip that. – Segno 11/9, 2018 at 18:53

Ah, that reminds me of one of Scott Meyer's talks about the CWUK scenerio that has a nature of propablities to slow down the emplace kind of construction. – Disentitle 11/9, 2018 at 19:47

interesting again that using manual conversion f5 runs much faster than using a constructor f4! – Fractionize 20/3, 2019 at 20:32

for tuple indexed vector solution #1 is the fastest – Cammiecammy 14/11, 2019 at 10:51

This is not the same benchmark as in the other answer, because the numbers of duplicates is almost surely different. Unfortunately, neither answer actually says how many duplicates there are in the random data used for the benchmarks... – Sava 3/4, 2020 at 16:8

@JakubKlinkovský NateKohls answer has it in the comments and this answer has it in the text - they both tell you the number of elements and the range of values. With 1.000.000 random integers in (1:1000) you know that each entry has ~1000 duplicates. – Memnon 15/9, 2023 at 8:23

@Memnon So this answer says I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000] and the comment for the other answer says I used datasets of 1,000,000 randomly drawn integers between 1 and 1000, 100, and 10 for this graph. Even if we assume that both benchmarks drawn random numbers with the same probability, they are different benchmarks with different numbers of duplicates. – Sava 15/9, 2023 at 20:14

std::unique only removes duplicate elements if they're neighbours: you have to sort the vector first before it will work as you intend.

std::unique is defined to be stable, so the vector will still be sorted after running unique on it.

Elevator answered 25/6, 2009 at 0:34 Comment(0)

I'm not sure what you are using this for, so I can't say this with 100% certainty, but normally when I think "sorted, unique" container, I think of a std::set. It might be a better fit for your usecase:

std::set<Foo> foos(vec.begin(), vec.end()); // both sorted & unique already

Otherwise, sorting prior to calling unique (as the other answers pointed out) is the way to go.

Rejoinder answered 25/6, 2009 at 1:2 Comment(7)

Well to the point! std::set is specified to be a sorted unique set. Most implementations use an efficient ordered binary tree or something similar. – Yerkovich 25/6, 2009 at 2:23

+1 Thought of set as well. Didnt want to duplicate this answer – Malina 25/6, 2009 at 2:28

Is std::set guaranteed to be sorted? It makes sense that in practice it would be, but does the standard require it? – Drynurse 25/6, 2009 at 3:37

Yup, see 23.1.4.9 "The fundamental property of iterators of associative containers is that they iterate through the containers in the non-descending order of keys where non-descending is defined by the comparison that was used to construct them" – Rejoinder 25/6, 2009 at 3:50

@MadCoder: It doesn't necessarily "make sense" that a set is implemented in a way that is sorted. There are also sets implemented using hash tables, which are not sorted. In fact, most people prefer using hash tables when available. But the naming convention in C++ just so happens that the sorted associative containers are named simply "set" / "map" (analogous to TreeSet / TreeMap in Java); and the hashed associative containers, which were left out of the standard, are called "hash_set" / "hash_map" (SGI STL) or "unordered_set" / "unordered_map" (TR1) (analogous to HashSet and HashMap in Java) – Conjunctivitis 25/6, 2009 at 5:54

@Todd Can anyone please confirm or unconfirm the following statement? If you want stable and happy with n.log(n) go for std::set ( or from scratch code, red and black tree implementation ). If you are OK with unstable and happy with log(n) go for std::unordered_set ( or from scratch code, hash-table implementation ). Note that n.log(n) and and log(n) are related to computation in my statement and I assume plenty of space. – Disraeli 24/10, 2014 at 13:24

@Disraeli 'Stable' and 'unstable' aren't relevant with unique sets. The difference between set and unordered_set is the sorting; if you need to access the elements in sorted order, then use set. If sorting is unimportant to you, use unordered_set (or for custom types your decision might be driven by which operators are implemented). If you need average low-latency lookup AND sorted traversal, use both. I'm not sure what log(n) you mean for the second one. Construction a set is nlog(n), but an unordeded_set would be on average n. Also construction is generally less relevant than find/insert/delete – Rejoinder 24/10, 2014 at 16:2

std::unique only works on consecutive runs of duplicate elements, so you'd better sort first. However, it is stable, so your vector will remain sorted.

Reflex answered 25/6, 2009 at 0:32 Comment(0)

Here's a template to do it for you:

template<typename T>
void removeDuplicates(std::vector<T>& vec)
{
    std::sort(vec.begin(), vec.end());
    vec.erase(std::unique(vec.begin(), vec.end()), vec.end());
}

call it like:

removeDuplicates<int>(vectorname);

Gastroscope answered 25/6, 2009 at 3:2 Comment(5)

+1 Templatize away! - but you can just write removeDuplicates(vec), without explicitly specifying the template arguments – Teniafuge 25/6, 2009 at 3:8

Or even better, just have it take templated iterators directly (begin and end), and you can run it on other structures besides a vector. – Mirepoix 25/6, 2009 at 23:58

Hells yeah, templates! quick fix for small lists, full STL style. +1 thx – Pickmeup 12/11, 2015 at 12:45

@Kyle - only on other containers that have an erase() method, else you have to return the new end iterator and have the calling code truncate the container. – Fingered 23/11, 2016 at 17:57

This is dangerous. It has the side-effect of sorting which is not obvious from the name. – Junkie 21/7, 2023 at 23:17

If you do not want to change the order of elements, then you can try this solution:

template <class T>
void RemoveDuplicatesInVector(std::vector<T> & vec)
{
    set<T> values;
    vec.erase(std::remove_if(vec.begin(), vec.end(), [&](const T & value) { return !values.insert(value).second; }), vec.end());
}

Greensand answered 31/7, 2015 at 14:36 Comment(2)

Perhaps use unordered_set instead of set (and boost::remove_erase_if if available) – Fernandefernandel 30/8, 2019 at 9:9

Not sure this is valid. std::ranges::remove_if has pred constrained to std::indirect_binary_predicate, subsuming std::predicate, whose component regular_invocable "requires [it...] be equality-preserving" - & depending on state like values makes it not equality-preserving. I would presume the same applies to the old remove_if as you use. – Goering 23/4, 2022 at 15:49

Efficiency is a complicated concept. There's time vs. space considerations, as well as general measurements (where you only get vague answers such as O(n)) vs. specific ones (e.g. bubble sort can be much faster than quicksort, depending on input characteristics).

If you have relatively few duplicates, then sort followed by unique and erase seems the way to go. If you had relatively many duplicates, creating a set from the vector and letting it do the heavy lifting could easily beat it.

Don't just concentrate on time efficiency either. Sort+unique+erase operates in O(1) space, while the set construction operates in O(n) space. And neither directly lends itself to a map-reduce parallelization (for really huge datasets).

Ginsburg answered 25/6, 2009 at 2:11 Comment(2)

What would give you map/reduce ability? The only one I can think of is a distributed merge sort and you can still only use one thread in the final merge. – Anzus 25/6, 2009 at 2:17

Yes, you must have one controlling node/thread. However, you can divide the problem as many times as required to place upper limits on the number of worker/child threads the controlling/parent thread deals with, and on the size of the dataset each leaf node must process. Not all problems are easily solvable with map-reduce, I simply wanted to point out there are people who deal with similar (on the surface, anyway) optimization issues, where dealing with 10 terabytes of data is called "Tuesday". – Ginsburg 25/6, 2009 at 3:43

Assuming that a is a vector, remove the contiguous duplicates using

a.erase(unique(a.begin(),a.end()),a.end()); runs in O(n) time.

Rennet answered 6/5, 2019 at 12:14 Comment(1)

contiguous duplicates. ok, so it needs a std::sort first. – Hildick 23/4, 2020 at 8:0

You need to sort it before you call unique because unique only removes duplicates that are next to each other.

edit: 38 seconds...

Eigenvalue answered 25/6, 2009 at 0:32 Comment(0)

unique only removes consecutive duplicate elements (which is necessary for it to run in linear time), so you should perform the sort first. It will remain sorted after the call to unique.

Zurheide answered 25/6, 2009 at 0:33 Comment(0)

You can do this as follows:

std::sort(v.begin(), v.end());
v.erase(std::unique(v.begin(), v.end()), v.end());

Bakerman answered 5/11, 2017 at 1:22 Comment(0)

With the Ranges v3 library, you can simply use

action::unique(vec);

Note that it actually removes the duplicate elements, not just move them.

Unfortunately, actions weren’t standardized in C++20 as other parts of the ranges library were you still have to use the original library even in C++20.

Brunei answered 10/7, 2019 at 0:11 Comment(1)

No actions in C++20, unfortunately. – Restricted 23/8, 2020 at 23:18

As already stated, unique requires a sorted container. Additionally, unique doesn't actually remove elements from the container. Instead, they are copied to the end, unique returns an iterator pointing to the first such duplicate element, and you are expected to call erase to actually remove the elements.

Again answered 25/6, 2009 at 0:56 Comment(3)

Does unique require a sorted container, or does it simply only rearrange the input sequence so it contains no adjacent duplicates? I thought the latter. – Ginsburg 25/6, 2009 at 2:13

@Pate, you're correct. It doesn't require one. It removes adjacent duplicates. – Jamnis 25/6, 2009 at 2:40

If you have a container that may have duplicates, and you want a container that doesn't have any duplicated values anywhere in the container then you must first sort the container, then pass it to unique, and then use erase to actually remove the duplicates. If you simply want to remove adjacent duplicates, then you won't have to sort the container. But you will end up with duplicated values: 1 2 2 3 2 4 2 5 2 will be changed to 1 2 3 2 4 2 5 2 if passed to unique without sorting, 1 2 3 4 5 if sorted, passed to unique and erase. – Again 25/6, 2009 at 18:14

The standard approach suggested by Nate Kohl, just using vector, sort + unique:

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );

doesn't work for a vector of pointers.

Look carefully at this example on cplusplus.com.

In their example, the "so called duplicates" moved to the end are actually shown as ? (undefined values), because those "so called duplicates" are SOMETIMES "extra elements" and SOMETIMES there are "missing elements" that were in the original vector.

A problem occurs when using std::unique() on a vector of pointers to objects (memory leaks, bad read of data from HEAP, duplicate frees, which cause segmentation faults, etc).

Here's my solution to the problem: replace std::unique() with ptgi::unique().

See the file ptgi_unique.hpp below:

// ptgi::unique()
//
// Fix a problem in std::unique(), such that none of the original elts in the collection are lost or duplicate.
// ptgi::unique() has the same interface as std::unique()
//
// There is the 2 argument version which calls the default operator== to compare elements.
//
// There is the 3 argument version, which you can pass a user defined functor for specialized comparison.
//
// ptgi::unique() is an improved version of std::unique() which doesn't looose any of the original data
// in the collection, nor does it create duplicates.
//
// After ptgi::unique(), every old element in the original collection is still present in the re-ordered collection,
// except that duplicates have been moved to a contiguous range [dupPosition, last) at the end.
//
// Thus on output:
//  [begin, dupPosition) range are unique elements.
//  [dupPosition, last) range are duplicates which can be removed.
// where:
//  [] means inclusive, and
//  () means exclusive.
//
// In the original std::unique() non-duplicates at end are moved downward toward beginning.
// In the improved ptgi:unique(), non-duplicates at end are swapped with duplicates near beginning.
//
// In addition if you have a collection of ptrs to objects, the regular std::unique() will loose memory,
// and can possibly delete the same pointer multiple times (leading to SEGMENTATION VIOLATION on Linux machines)
// but ptgi::unique() won't.  Use valgrind(1) to find such memory leak problems!!!
//
// NOTE: IF you have a vector of pointers, that is, std::vector<Object*>, then upon return from ptgi::unique()
// you would normally do the following to get rid of the duplicate objects in the HEAP:
//
//  // delete objects from HEAP
//  std::vector<Object*> objects;
//  for (iter = dupPosition; iter != objects.end(); ++iter)
//  {
//      delete (*iter);
//  }
//
//  // shrink the vector. But Object * pointers are NOT followed for duplicate deletes, this shrinks the vector.size())
//  objects.erase(dupPosition, objects.end));
//
// NOTE: But if you have a vector of objects, that is: std::vector<Object>, then upon return from ptgi::unique(), it
// suffices to just call vector:erase(, as erase will automatically call delete on each object in the
// [dupPosition, end) range for you:
//
//  std::vector<Object> objects;
//  objects.erase(dupPosition, last);
//
//==========================================================================================================
// Example of differences between std::unique() vs ptgi::unique().
//
//  Given:
//      int data[] = {10, 11, 21};
//
//  Given this functor: ArrayOfIntegersEqualByTen:
//      A functor which compares two integers a[i] and a[j] in an int a[] array, after division by 10:
//  
//  // given an int data[] array, remove consecutive duplicates from it.
//  // functor used for std::unique (BUGGY) or ptgi::unique(IMPROVED)
//
//  // Two numbers equal if, when divided by 10 (integer division), the quotients are the same.
//  // Hence 50..59 are equal, 60..69 are equal, etc.
//  struct ArrayOfIntegersEqualByTen: public std::equal_to<int>
//  {
//      bool operator() (const int& arg1, const int& arg2) const
//      {
//          return ((arg1/10) == (arg2/10));
//      }
//  };
//  
//  Now, if we call (problematic) std::unique( data, data+3, ArrayOfIntegersEqualByTen() );
//  
//  TEST1: BEFORE UNIQ: 10,11,21
//  TEST1: AFTER UNIQ: 10,21,21
//  DUP_INX=2
//  
//      PROBLEM: 11 is lost, and extra 21 has been added.
//  
//  More complicated example:
//  
//  TEST2: BEFORE UNIQ: 10,20,21,22,30,31,23,24,11
//  TEST2: AFTER UNIQ: 10,20,30,23,11,31,23,24,11
//  DUP_INX=5
//  
//      Problem: 21 and 22 are deleted.
//      Problem: 11 and 23 are duplicated.
//  
//  
//  NOW if ptgi::unique is called instead of std::unique, both problems go away:
//  
//  DEBUG: TEST1: NEW_WAY=1
//  TEST1: BEFORE UNIQ: 10,11,21
//  TEST1: AFTER UNIQ: 10,21,11
//  DUP_INX=2
//  
//  DEBUG: TEST2: NEW_WAY=1
//  TEST2: BEFORE UNIQ: 10,20,21,22,30,31,23,24,11
//  TEST2: AFTER UNIQ: 10,20,30,23,11,31,22,24,21
//  DUP_INX=5
//
//  @SEE: look at the "case study" below to understand which the last "AFTER UNIQ" results with that order:
//  TEST2: AFTER UNIQ: 10,20,30,23,11,31,22,24,21
//
//==========================================================================================================
// Case Study: how ptgi::unique() works:
//  Remember we "remove adjacent duplicates".
//  In this example, the input is NOT fully sorted when ptgi:unique() is called.
//
//  I put | separatators, BEFORE UNIQ to illustrate this
//  10  | 20,21,22 |  30,31 |  23,24 | 11
//
//  In example above, 20, 21, 22 are "same" since dividing by 10 gives 2 quotient.
//  And 30,31 are "same", since /10 quotient is 3.
//  And 23, 24 are same, since /10 quotient is 2.
//  And 11 is "group of one" by itself.
//  So there are 5 groups, but the 4th group (23, 24) happens to be equal to group 2 (20, 21, 22)
//  So there are 5 groups, and the 5th group (11) is equal to group 1 (10)
//
//  R = result
//  F = first
//
//  10, 20, 21, 22, 30, 31, 23, 24, 11
//  R    F
//
//  10 is result, and first points to 20, and R != F (10 != 20) so bump R:
//       R
//       F
//
//  Now we hits the "optimized out swap logic".
//  (avoid swap because R == F)
//
//  // now bump F until R != F (integer division by 10)
//  10, 20, 21, 22, 30, 31, 23, 24, 11
//       R   F              // 20 == 21 in 10x
//       R       F              // 20 == 22 in 10x
//       R           F          // 20 != 30, so we do a swap of ++R and F
//  (Now first hits 21, 22, then finally 30, which is different than R, so we swap bump R to 21 and swap with  30)
//  10, 20, 30, 22, 21, 31, 23, 24, 11  // after R & F swap (21 and 30)
//           R       F 
//
//  10, 20, 30, 22, 21, 31, 23, 24, 11
//           R          F           // bump F to 31, but R and F are same (30 vs 31)
//           R               F      // bump F to 23, R != F, so swap ++R with F
//  10, 20, 30, 22, 21, 31, 23, 24, 11
//                  R           F       // bump R to 22
//  10, 20, 30, 23, 21, 31, 22, 24, 11  // after the R & F swap (22 & 23 swap)
//                  R            F      // will swap 22 and 23
//                  R                F      // bump F to 24, but R and F are same in 10x
//                  R                    F  // bump F, R != F, so swap ++R  with F
//                      R                F  // R and F are diff, so swap ++R  with F (21 and 11)
//  10, 20, 30, 23, 11, 31, 22, 24, 21
//                      R                F  // aftter swap of old 21 and 11
//                      R                  F    // F now at last(), so loop terminates
//                          R               F   // bump R by 1 to point to dupPostion (first duplicate in range)
//
//  return R which now points to 31
//==========================================================================================================
// NOTES:
// 1) the #ifdef IMPROVED_STD_UNIQUE_ALGORITHM documents how we have modified the original std::unique().
// 2) I've heavily unit tested this code, including using valgrind(1), and it is *believed* to be 100% defect-free.
//
//==========================================================================================================
// History:
//  130201  dpb [email protected] created
//==========================================================================================================

#ifndef PTGI_UNIQUE_HPP
#define PTGI_UNIQUE_HPP

// Created to solve memory leak problems when calling std::unique() on a vector<Route*>.
// Memory leaks discovered with valgrind and unitTesting.


#include <algorithm>        // std::swap

// instead of std::myUnique, call this instead, where arg3 is a function ptr
//
// like std::unique, it puts the dups at the end, but it uses swapping to preserve original
// vector contents, to avoid memory leaks and duplicate pointers in vector<Object*>.

#ifdef IMPROVED_STD_UNIQUE_ALGORITHM
#error the #ifdef for IMPROVED_STD_UNIQUE_ALGORITHM was defined previously.. Something is wrong.
#endif

#undef IMPROVED_STD_UNIQUE_ALGORITHM
#define IMPROVED_STD_UNIQUE_ALGORITHM

// similar to std::unique, except that this version swaps elements, to avoid
// memory leaks, when vector contains pointers.
//
// Normally the input is sorted.
// Normal std::unique:
// 10 20 20 20 30   30 20 20 10
// a  b  c  d  e    f  g  h  i
//
// 10 20 30 20 10 | 30 20 20 10
// a  b  e  g  i    f  g  h  i
//
// Now GONE: c, d.
// Now DUPS: g, i.
// This causes memory leaks and segmenation faults due to duplicate deletes of same pointer!


namespace ptgi {

// Return the position of the first in range of duplicates moved to end of vector.
//
// uses operator==  of class for comparison
//
// @param [first, last) is a range to find duplicates within.
//
// @return the dupPosition position, such that [dupPosition, end) are contiguous
// duplicate elements.
// IF all items are unique, then it would return last.
//
template <class ForwardIterator>
ForwardIterator unique( ForwardIterator first, ForwardIterator last)
{
    // compare iterators, not values
    if (first == last)
        return last;

    // remember the current item that we are looking at for uniqueness
    ForwardIterator result = first;

    // result is slow ptr where to store next unique item
    // first is  fast ptr which is looking at all elts

    // the first iterator moves over all elements [begin+1, end).
    // while the current item (result) is the same as all elts
    // to the right, (first) keeps going, until you find a different
    // element pointed to by *first.  At that time, we swap them.

    while (++first != last)
    {
        if (!(*result == *first))
        {
#ifdef IMPROVED_STD_UNIQUE_ALGORITHM
            // inc result, then swap *result and *first

//          THIS IS WHAT WE WANT TO DO.
//          BUT THIS COULD SWAP AN ELEMENT WITH ITSELF, UNCECESSARILY!!!
//          std::swap( *first, *(++result));

            // BUT avoid swapping with itself when both iterators are the same
            ++result;
            if (result != first)
                std::swap( *first, *result);
#else
            // original code found in std::unique()
            // copies unique down
            *(++result) = *first;
#endif
        }
    }

    return ++result;
}

template <class ForwardIterator, class BinaryPredicate>
ForwardIterator unique( ForwardIterator first, ForwardIterator last, BinaryPredicate pred)
{
    if (first == last)
        return last;

    // remember the current item that we are looking at for uniqueness
    ForwardIterator result = first;

    while (++first != last)
    {
        if (!pred(*result,*first))
        {
#ifdef IMPROVED_STD_UNIQUE_ALGORITHM
            // inc result, then swap *result and *first

//          THIS COULD SWAP WITH ITSELF UNCECESSARILY
//          std::swap( *first, *(++result));
//
            // BUT avoid swapping with itself when both iterators are the same
            ++result;
            if (result != first)
                std::swap( *first, *result);

#else
            // original code found in std::unique()
            // copies unique down
            // causes memory leaks, and duplicate ptrs
            // and uncessarily moves in place!
            *(++result) = *first;
#endif
        }
    }

    return ++result;
}

// from now on, the #define is no longer needed, so get rid of it
#undef IMPROVED_STD_UNIQUE_ALGORITHM

} // end ptgi:: namespace

#endif

And here is the UNIT Test program that I used to test it:

// QUESTION: in test2, I had trouble getting one line to compile,which was caused  by the declaration of operator()
// in the equal_to Predicate.  I'm not sure how to correctly resolve that issue.
// Look for //OUT lines
//
// Make sure that NOTES in ptgi_unique.hpp are correct, in how we should "cleanup" duplicates
// from both a vector<Integer> (test1()) and vector<Integer*> (test2).
// Run this with valgrind(1).
//
// In test2(), IF we use the call to std::unique(), we get this problem:
//
//  [dbednar@ipeng8 TestSortRoutes]$ ./Main7
//  TEST2: ORIG nums before UNIQUE: 10, 20, 21, 22, 30, 31, 23, 24, 11
//  TEST2: modified nums AFTER UNIQUE: 10, 20, 30, 23, 11, 31, 23, 24, 11
//  INFO: dupInx=5
//  TEST2: uniq = 10
//  TEST2: uniq = 20
//  TEST2: uniq = 30
//  TEST2: uniq = 33427744
//  TEST2: uniq = 33427808
//  Segmentation fault (core dumped)
//
// And if we run valgrind we seen various error about "read errors", "mismatched free", "definitely lost", etc.
//
//  valgrind --leak-check=full ./Main7
//  ==359== Memcheck, a memory error detector
//  ==359== Command: ./Main7
//  ==359== Invalid read of size 4
//  ==359== Invalid free() / delete / delete[]
//  ==359== HEAP SUMMARY:
//  ==359==     in use at exit: 8 bytes in 2 blocks
//  ==359== LEAK SUMMARY:
//  ==359==    definitely lost: 8 bytes in 2 blocks
// But once we replace the call in test2() to use ptgi::unique(), all valgrind() error messages disappear.
//
// 130212   dpb [email protected] created
// =========================================================================================================

#include <iostream> // std::cout, std::cerr
#include <string>
#include <vector>   // std::vector
#include <sstream>  // std::ostringstream
#include <algorithm>    // std::unique()
#include <functional>   // std::equal_to(), std::binary_function()
#include <cassert>  // assert() MACRO

#include "ptgi_unique.hpp"  // ptgi::unique()



// Integer is small "wrapper class" around a primitive int.
// There is no SETTER, so Integer's are IMMUTABLE, just like in JAVA.

class Integer
{
private:
    int num;
public:

    // default CTOR: "Integer zero;"
    // COMPRENSIVE CTOR:  "Integer five(5);"
    Integer( int num = 0 ) :
        num(num)
    {
    }

    // COPY CTOR
    Integer( const Integer& rhs) :
        num(rhs.num)
    {
    }

    // assignment, operator=, needs nothing special... since all data members are primitives

    // GETTER for 'num' data member
    // GETTER' are *always* const
    int getNum() const
    {
        return num;
    }   

    // NO SETTER, because IMMUTABLE (similar to Java's Integer class)

    // @return "num"
    // NB: toString() should *always* be a const method
    //
    // NOTE: it is probably more efficient to call getNum() intead
    // of toString() when printing a number:
    //
    // BETTER to do this:
    //  Integer five(5);
    //  std::cout << five.getNum() << "\n"
    // than this:
    //  std::cout << five.toString() << "\n"

    std::string toString() const
    {
        std::ostringstream oss;
        oss << num;
        return oss.str();
    }
};

// convenience typedef's for iterating over std::vector<Integer>
typedef std::vector<Integer>::iterator      IntegerVectorIterator;
typedef std::vector<Integer>::const_iterator    ConstIntegerVectorIterator;

// convenience typedef's for iterating over std::vector<Integer*>
typedef std::vector<Integer*>::iterator     IntegerStarVectorIterator;
typedef std::vector<Integer*>::const_iterator   ConstIntegerStarVectorIterator;

// functor used for std::unique or ptgi::unique() on a std::vector<Integer>
// Two numbers equal if, when divided by 10 (integer division), the quotients are the same.
// Hence 50..59 are equal, 60..69 are equal, etc.
struct IntegerEqualByTen: public std::equal_to<Integer>
{
    bool operator() (const Integer& arg1, const Integer& arg2) const
    {
        return ((arg1.getNum()/10) == (arg2.getNum()/10));
    }
};

// functor used for std::unique or ptgi::unique on a std::vector<Integer*>
// Two numbers equal if, when divided by 10 (integer division), the quotients are the same.
// Hence 50..59 are equal, 60..69 are equal, etc.
struct IntegerEqualByTenPointer: public std::equal_to<Integer*>
{
    // NB: the Integer*& looks funny to me!
    // TECHNICAL PROBLEM ELSEWHERE so had to remove the & from *&
//OUT   bool operator() (const Integer*& arg1, const Integer*& arg2) const
//
    bool operator() (const Integer* arg1, const Integer* arg2) const
    {
        return ((arg1->getNum()/10) == (arg2->getNum()/10));
    }
};

void test1();
void test2();
void printIntegerStarVector( const std::string& msg, const std::vector<Integer*>& nums );

int main()
{
    test1();
    test2();
    return 0;
}

// test1() uses a vector<Object> (namely vector<Integer>), so there is no problem with memory loss
void test1()
{
    int data[] = { 10, 20, 21, 22, 30, 31, 23, 24, 11};

    // turn C array into C++ vector
    std::vector<Integer> nums(data, data+9);

    // arg3 is a functor
    IntegerVectorIterator dupPosition = ptgi::unique( nums.begin(), nums.end(), IntegerEqualByTen() );

    nums.erase(dupPosition, nums.end());

    nums.erase(nums.begin(), dupPosition);
}

//==================================================================================
// test2() uses a vector<Integer*>, so after ptgi:unique(), we have to be careful in
// how we eliminate the duplicate Integer objects stored in the heap.
//==================================================================================
void test2()
{
    int data[] = { 10, 20, 21, 22, 30, 31, 23, 24, 11};

    // turn C array into C++ vector of Integer* pointers
    std::vector<Integer*> nums;

    // put data[] integers into equivalent Integer* objects in HEAP
    for (int inx = 0; inx < 9; ++inx)
    {
        nums.push_back( new Integer(data[inx]) );
    }

    // print the vector<Integer*> to stdout
    printIntegerStarVector( "TEST2: ORIG nums before UNIQUE", nums );

    // arg3 is a functor
#if 1
    // corrected version which fixes SEGMENTATION FAULT and all memory leaks reported by valgrind(1)
    // I THINK we want to use new C++11 cbegin() and cend(),since the equal_to predicate is passed "Integer *&"

//  DID NOT COMPILE
//OUT   IntegerStarVectorIterator dupPosition = ptgi::unique( const_cast<ConstIntegerStarVectorIterator>(nums.begin()), const_cast<ConstIntegerStarVectorIterator>(nums.end()), IntegerEqualByTenPointer() );

    // DID NOT COMPILE when equal_to predicate declared "Integer*& arg1, Integer*&  arg2"
//OUT   IntegerStarVectorIterator dupPosition = ptgi::unique( const_cast<nums::const_iterator>(nums.begin()), const_cast<nums::const_iterator>(nums.end()), IntegerEqualByTenPointer() );


    // okay when equal_to predicate declared "Integer* arg1, Integer*  arg2"
    IntegerStarVectorIterator dupPosition = ptgi::unique(nums.begin(), nums.end(), IntegerEqualByTenPointer() );
#else
    // BUGGY version that causes SEGMENTATION FAULT and valgrind(1) errors
    IntegerStarVectorIterator dupPosition = std::unique( nums.begin(), nums.end(), IntegerEqualByTenPointer() );
#endif

    printIntegerStarVector( "TEST2: modified nums AFTER UNIQUE", nums );
    int dupInx = dupPosition - nums.begin();
    std::cout << "INFO: dupInx=" << dupInx <<"\n";

    // delete the dup Integer* objects in the [dupPosition, end] range
    for (IntegerStarVectorIterator iter = dupPosition; iter != nums.end(); ++iter)
    {
        delete (*iter);
    }

    // shrink the vector
    // NB: the Integer* ptrs are NOT followed by vector::erase()
    nums.erase(dupPosition, nums.end());


    // print the uniques, by following the iter to the Integer* pointer
    for (IntegerStarVectorIterator iter = nums.begin(); iter != nums.end();  ++iter)
    {
        std::cout << "TEST2: uniq = " << (*iter)->getNum() << "\n";
    }

    // remove the unique objects from heap
    for (IntegerStarVectorIterator iter = nums.begin(); iter != nums.end();  ++iter)
    {
        delete (*iter);
    }

    // shrink the vector
    nums.erase(nums.begin(), nums.end());

    // the vector should now be completely empty
    assert( nums.size() == 0);
}

//@ print to stdout the string: "info_msg: num1, num2, .... numN\n"
void printIntegerStarVector( const std::string& msg, const std::vector<Integer*>& nums )
{
    std::cout << msg << ": ";
    int inx = 0;
    ConstIntegerStarVectorIterator  iter;

    // use const iterator and const range!
    // NB: cbegin() and cend() not supported until LATER (c++11)
    for (iter = nums.begin(), inx = 0; iter != nums.end(); ++iter, ++inx)
    {
        // output a comma seperator *AFTER* first
        if (inx > 0)
            std::cout << ", ";

        // call Integer::toString()
        std::cout << (*iter)->getNum();     // send int to stdout
//      std::cout << (*iter)->toString();   // also works, but is probably slower

    }

    // in conclusion, add newline
    std::cout << "\n";
}

Sympathy answered 13/2, 2013 at 0:6 Comment(7)

I don't understand the rationale here. So if you have a container of pointers, and you want to remove duplicates, how does that affect the objects pointed to by the pointers? No memory leaks would happen because there is at least one pointer (and exactly one in this container) that points to them. Well, well, I guess your method might have some merit with some strange overloaded operators or strange comparison functions that do require special consideration. – Royceroyd 15/2, 2013 at 3:2

Not sure if I understand your point. Take a simple case of a vector<int*>, where the 4 pointers point to integers {1, 2. 2, 3}. Its sorted, but after you call std::unique, the 4 pointers are pointers to integers {1, 2, 3, 3}. Now you have two identical pointers to 3, so if you call delete, it does a duplicate delete. BAD! Second, notice that the 2nd 2 is misssing, a memory leak. – Sympathy 15/2, 2013 at 17:50

kccqzy, heres the example program for you to understand my answer better: – Sympathy 15/2, 2013 at 18:11

@joe: Even if after std::unique you had [1, 2, 3, 2] you can't call delete on 2 as that would leave a dangling pointer to 2! => Simply don't call delete on the elements between newEnd = std::unique and std::end as you still have pointers to these elements in [std::begin, newEnd)! – Proa 4/8, 2014 at 9:51

vector<T*> with owning pointers is usually an anti-pattern in C++11. unique works fine with vector<unique_ptr<T>>. – Filtrate 29/10, 2015 at 10:44

@ArneVogel: For trivial values of "works fine", perhaps. It's rather pointless to call unique on a vector<unique_ptr<T>>, as the only duplicated value such a vector can contain is nullptr. – Devereux 14/11, 2016 at 20:37

@BenVoigt: Overload operator == or pass a BinaryPredicate. The default behavior would be pointless indeed. – Filtrate 24/5, 2017 at 15:54

If you are looking for performance and using std::vector, I recommend the one that this documentation link provides.

std::vector<int> myvector{10,20,20,20,30,30,20,20,10};             // 10 20 20 20 30 30 20 20 10
std::sort(myvector.begin(), myvector.end() );
const auto& it = std::unique (myvector.begin(), myvector.end());   // 10 20 30 ?  ?  ?  ?  ?  ?
                                                                   //          ^
myvector.resize( std::distance(myvector.begin(),it) ); // 10 20 30

Duality answered 15/12, 2017 at 21:36 Comment(1)

cplusplus.com is not in any way official documentation. – Chuppah 16/3, 2018 at 18:4

More understandable code from: https://en.cppreference.com/w/cpp/algorithm/unique

#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <cctype>

int main() 
{
    // remove duplicate elements
    std::vector<int> v{1,2,3,1,2,3,3,4,5,4,5,6,7};
    std::sort(v.begin(), v.end()); // 1 1 2 2 3 3 3 4 4 5 5 6 7 
    auto last = std::unique(v.begin(), v.end());
    // v now holds {1 2 3 4 5 6 7 x x x x x x}, where 'x' is indeterminate
    v.erase(last, v.end()); 
    for (int i : v)
      std::cout << i << " ";
    std::cout << "\n";
}

ouput:

1 2 3 4 5 6 7

Lecture answered 10/9, 2018 at 0:27 Comment(0)

void removeDuplicates(std::vector<int>& arr) {
    for (int i = 0; i < arr.size(); i++)
    {
        for (int j = i + 1; j < arr.size(); j++)
        {
            if (arr[i] > arr[j])
            {
                int temp = arr[i];
                arr[i] = arr[j];
                arr[j] = temp;
            }
        }
    }
    std::vector<int> y;
    int x = arr[0];
    int i = 0;
    while (i < arr.size())
    {
        if (x != arr[i])
        {
            y.push_back(x);
            x = arr[i];
        }
        i++;
        if (i == arr.size())
            y.push_back(arr[i - 1]);
    }
    arr = y;
}

Bultman answered 31/10, 2019 at 15:28 Comment(1)

Welcome to StackOverflow! Please edit your question to add an explanation of how you code works, and why it's equivalent or better than the other answers. This question is more than ten years old, and already has many good, well-explained answers. Without an explanation on yours, it's not as useful and has a good chance of being downvoted or removed. – Viccora 31/10, 2019 at 15:48

The following simple method of tracking iterator positions that has time complexity of O(nlogn) and space complexity of O(1) seems like has not been mentioned earlier:

#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

void removeDups(vector<int>& vect)
{
  int u = 0;
  sort(vect.begin(), vect.end());
  for (int i = 1; i < vect.size(); ) {
    if (vect[u] == vect[i]) {
      ++i;
    }
    else {
      ++u;
      swap(vect[u],vect[i]);
      ++i;
    }


  }
  vect.erase(vect.begin() + u + 1, vect.end() );
}


int main()
{
  vector<int> vect{ 3, 5, 7, 2, 2, 5, 7, 7, 9, 2, 3, 5, 7, 9, 3, 5, 7, 9 };
  removeDups(vect);
  for (int i = 0; i < vect.size(); i++)
    cout << vect[i] << " ";
  return 0;
}

Schizopod answered 12/3, 2024 at 10:37 Comment(2)

That approach has worst-case time complexity O(n^2), since vector::erase has complexity O(n). Beyond that, it’s the same as other answers. – Applesauce 13/3, 2024 at 11:14

@Applesauce Thanks! I have changed the logic to use swap while in that for loop of O(n) and invoke erase only after getting out of the loop. This should make the worst case to be O(nlog(n)). – Schizopod 14/3, 2024 at 7:29

std::set<int> s;
std::for_each(v.cbegin(), v.cend(), [&s](int val){s.insert(val);});
v.clear();
std::copy(s.cbegin(), s.cend(), v.cbegin());

Exsanguinate answered 11/11, 2013 at 7:13 Comment(1)

perhaps resize the vector after clearing it so that there is only 1 memory allocation when building the vector. Maybe prefer std::move instead of std::copy to move the ints into the vector instead of copying them since the set will not be needed later. – Poock 3/3, 2014 at 20:10

About alexK7 benchmarks. I tried them and got similar results, but when the range of values is 1 million the cases using std::sort (f1) and using std::unordered_set (f5) produce similar time. When the range of values is 10 million f1 is faster than f5.

If the range of values is limited and the values are unsigned int, it is possible to use std::vector, the size of which corresponds to the given range. Here is the code:

void DeleteDuplicates_vector_bool(std::vector<unsigned>& v, unsigned range_size)
{
    std::vector<bool> v1(range_size);
    for (auto& x: v)
    {
       v1[x] = true;    
    }
    v.clear();

    unsigned count = 0;
    for (auto& x: v1)
    {
        if (x)
        {
            v.push_back(count);
        }
        ++count;
    }
}

Beeson answered 16/6, 2015 at 18:18 Comment(0)

If your class is easily convertible to an int, and you got some memory, unique can be done without sorting before, and it's much faster :

#include <vector>
#include <stdlib.h>
#include <algorithm>
int main (int argc, char* argv []) {
  //vector init
  std::vector<int> v (1000000, 0);
  std::for_each (v.begin (), v.end (), [] (int& s) {s = rand () %1000;});
  std::vector<int> v1 (v);
  int beg (0), end (0), duration (0);
  beg = clock ();
  {
    std::sort (v.begin (), v.end ());
    auto i (v.begin ());
    i = std::unique (v.begin (), v.end ());
    if (i != v.end ()) v.erase (i, v.end ());
  }
  end = clock ();
  duration = (int) (end - beg);
  std::cout << "\tduration sort + unique == " << duration << std::endl;

  int n (0);
  duration = 0;
  beg = clock ();
  std::for_each (v1.begin (), v1.end (), [&n] (const int& s) {if (s >= n) n = s+1;});
  std::vector<int> tab (n, 0);
  {
    auto i (v1.begin ());
    std::for_each (v1.begin (), v1.end (), [&i, &tab] (const int& s) {
      if (!tab [s]) {
        *i++ = s;
        ++tab [s];
      }
    });
    std::sort (v1.begin (), i);
    v1.erase (i, v1.end ());
  }
  end = clock ();
  duration = (int) (end - beg);
  std::cout << "\tduration unique + sort == " << duration << std::endl;
  if (v == v1) {
    std::cout << "and results are same" << std::endl;
  }
  else {
    std::cout << "but result differs" << std::endl;
  }  
}

Typical results : duration sort + unique == 38985 duration unique + sort == 2500 and results are same

Ileneileo answered 23/4, 2021 at 20:19 Comment(0)

Most of the answer seems to be using O(nlogn) but with the use of the unordered_set we can decrease it to O(n). I saw some of the solutions using sets, but I found this one and it seems more elegant to use set and iterators.

using Intvec = std::vector<int>;

void remove(Intvec &v) {
    // creating iterator starting with beginning of the vector 
    Intvec::iterator itr = v.begin();
    std::unordered_set<int> s;
    // loops from the beginning to the end of the list 
    for (auto curr = v.begin(); curr != v.end(); ++curr) {
        if (s.insert(*curr).second) { // if the 0 curr already exist in the set
            *itr++ = *curr; // adding a position to the iterator 
        }
    }
    // erasing repeating positions in the set 
    v.erase(itr, v.end());
}

Paba answered 6/12, 2021 at 3:15 Comment(0)

Depends on use-case. If you expect less than 100 amount of positive integer unique values and if you have a cpu capable of avx512f instruction set, then you can insert elements at a rate of ~15 clock cycles per element or 300-500 million inserts per second, by using simple comparison against a small look-up-table.

Following implementation uses CPU registers to do value lookups for ~50 unique values and L1 cache for ~1000 unique values. For L1 cache version, it takes about 160 clock cycles per insertion which is equivalent to about ~25M insertions per second and becomes slower than using std::set. For only 4 unique values, it inserts at a rate of 5.8 cycles per element which is higher than 500M/s.

//g++  7.4.0
// time measurement taken from another answer
// valid C99 and C++

#include <stdint.h>  // <cstdint> is preferred in C++, but stdint.h works.

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
     _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    uint64_t tsc = __rdtsc();
     _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

// requires a Nehalem or newer CPU.  Not Core2 or earlier.  IDK when AMD added it.
inline
uint64_t readTSCp() {
    unsigned dummy;
    return __rdtscp(&dummy);  // waits for earlier insns to retire, but allows later to start
}



#include <iostream>

template<int n>
struct FastUnique
{
    public:
    FastUnique()
    {
         it=0;
         for(int i=0;i<n;i++)
             dict[i]=-1;
    }

    void insert(const int val)
    {
        if(!test(dict,val))
            dict[it++]=val;
    }

    const int get(const int index)
    {
        return dict[index];
    }

    const int size()
    {
        return it;
    }

    private:
    int dict[n];
    int it;
    bool test(const int * dict, const int val)
    {
        int c=0;
        for(int i=0;i<n;i++)
            c+=(dict[i]==val);
        return c>0;
    }
};

int main()
{
    std::cout << "Hello, world!\n";
    const int n=500000000;

    FastUnique<64> fastSet;

    auto t= readTSC();

    for(int i=0;i<n;i++)
        fastSet.insert(i&63);

    auto t2=readTSC();

    std::cout<<(t2-t)/(double)n<<"cycles per iteration"<<std::endl;
   
    for(int i=0;i<fastSet.size();i++)
        std::cout<<fastSet.get(i)<<std::endl;
    
    return 0;
}

Mulhouse answered 26/3, 2022 at 21:28 Comment(0)

That's the problem with naming in the std... std::unique should rather be called std::trim_consecutive_duplicates imho, that would make it clear that you need to sort the vector first to have elements with the same value adjacent to each other. In this case I doubt that anything related to a set is fasteor when arriving from vector, but if you have the opportunity to put everything into a set from the start you should definitely do that.

Here's a modern C++20 example (Demo):

#include <algorithm>
#include <iostream>
#include <ranges>
#include <vector>
#include <cstdint>

namespace rng = std::ranges;

int main() {
    std::vector<uint32_t> myvec = { 255, 1,3, 16, 5,6, 1, 3, 3, 255, 300 };

    rng::sort(myvec);
    const auto [first, last] = rng::unique(myvec);
    myvec.erase(first, last);

    // Print resulting vector
    std::cout << "my unique vector = {";
    rng::for_each(myvec, [](uint32_t val){ std::cout << val << ", "; });
    std::cout << "}" << std::endl;
}

Output:

my unique vector = {1, 3, 5, 6, 16, 255, 300, }

Windowshop answered 30/5, 2023 at 8:8 Comment(0)

-1

Here's the example of the duplicate delete problem that occurs with std::unique(). On a LINUX machine, the program crashes. Read the comments for details.

// Main10.cpp
//
// Illustration of duplicate delete and memory leak in a vector<int*> after calling std::unique.
// On a LINUX machine, it crashes the progam because of the duplicate delete.
//
// INPUT : {1, 2, 2, 3}
// OUTPUT: {1, 2, 3, 3}
//
// The two 3's are actually pointers to the same 3 integer in the HEAP, which is BAD
// because if you delete both int* pointers, you are deleting the same memory
// location twice.
//
//
// Never mind the fact that we ignore the "dupPosition" returned by std::unique(),
// but in any sensible program that "cleans up after istelf" you want to call deletex
// on all int* poitners to avoid memory leaks.
//
//
// NOW IF you replace std::unique() with ptgi::unique(), all of the the problems disappear.
// Why? Because ptgi:unique merely reshuffles the data:
// OUTPUT: {1, 2, 3, 2}
// The ptgi:unique has swapped the last two elements, so all of the original elements in
// the INPUT are STILL in the OUTPUT.
//
// 130215   [email protected]
//============================================================================

#include <iostream>
#include <vector>
#include <algorithm>
#include <functional>

#include "ptgi_unique.hpp"

// functor used by std::unique to remove adjacent elts from vector<int*>
struct EqualToVectorOfIntegerStar: public std::equal_to<int *>
{
    bool operator() (const int* arg1, const int* arg2) const
    {
        return (*arg1 == *arg2);
    }
};

void printVector( const std::string& msg, const std::vector<int*>& vnums);

int main()
{
    int inums [] = { 1, 2, 2, 3 };
    std::vector<int*> vnums;

    // convert C array into vector of pointers to integers
    for (size_t inx = 0; inx < 4; ++ inx)
        vnums.push_back( new int(inums[inx]) );

    printVector("BEFORE UNIQ", vnums);

    // INPUT : 1, 2A, 2B, 3
    std::unique( vnums.begin(), vnums.end(), EqualToVectorOfIntegerStar() );
    // OUTPUT: 1, 2A, 3, 3 }
    printVector("AFTER  UNIQ", vnums);

    // now we delete 3 twice, and we have a memory leak because 2B is not deleted.
    for (size_t inx = 0; inx < vnums.size(); ++inx)
    {
        delete(vnums[inx]);
    }
}

// print a line of the form "msg: 1,2,3,..,5,6,7\n", where 1..7 are the numbers in vnums vector
// PS: you may pass "hello world" (const char *) because of implicit (automatic) conversion
// from "const char *" to std::string conversion.

void printVector( const std::string& msg, const std::vector<int*>& vnums)
{
    std::cout << msg << ": ";

    for (size_t inx = 0; inx < vnums.size(); ++inx)
    {
        // insert comma separator before current elt, but ONLY after first elt
        if (inx > 0)
            std::cout << ",";
        std::cout << *vnums[inx];

    }
    std::cout << "\n";
}

Sympathy answered 15/2, 2013 at 18:13 Comment(4)

PS: I also ran "valgrind ./Main10", and valgrind found no problems. I HIGHLY recommend all C++ programmers using LINUX to use this very productive tool, esp if you are writing real-time applications that must run 24x7, and never leak or crash! – Sympathy 15/2, 2013 at 18:20

The heart of the problem with std::unique can be summed up by this statement "std::unique returns duplicates in unspecified state" !!!!! Why the standards committee did this, I'll never know. Committe members.. any comments ??? – Sympathy 15/2, 2013 at 18:32

Yes, "std::unique returns duplicates in unspecified state". So, simply don't rely on a array that has been "uniqued" to manually manage memory! The simplest way to do this is to use std::unique_ptr instead of raw pointers. – Segno 29/6, 2014 at 13:37

This appears to be a response to a different answer; it does not answer the question (in which the vector contains integers, not pointers, and does not specify a comparator). – Fingered 23/11, 2016 at 18:1

-3

void EraseVectorRepeats(vector <int> & v){ 
TOP:for(int y=0; y<v.size();++y){
        for(int z=0; z<v.size();++z){
            if(y==z){ //This if statement makes sure the number that it is on is not erased-just skipped-in order to keep only one copy of a repeated number
                continue;}
            if(v[y]==v[z]){
                v.erase(v.begin()+z); //whenever a number is erased the function goes back to start of the first loop because the size of the vector changes
            goto TOP;}}}}

This is a function that I created that you can use to delete repeats. The header files needed are just <iostream> and <vector>.

Knee answered 10/4, 2018 at 0:30 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags