Can the Duplicate Characters in a string be Identified and Quantified in O(n)?
Asked Answered
W

2

10

This comment suggests that there is a O(n) alternative to my O(n log n) solution to this problem:

Given string str("helloWorld") the expected output is:

l = 3
o = 2

My solution was to do this:

sort(begin(str), end(str));

for(auto start = adjacent_find(cbegin(str), cend(str)), finish = upper_bound(start, cend(str), *start); start != cend(str); start = adjacent_find(finish, cend(str)), finish = upper_bound(start, cend(str), *start)) {
   cout << *start << " = " << distance(start, finish) << endl;
}

Which is obviously limited by the sorting of str. I think this would require a bucket sort solution? Is there anything more clever that I'm missing?

Wang answered 2/1, 2018 at 16:8 Comment(13)
Build a hash-table from characters to counters, and increment counters for each characters in the string. This is O(n). Then, iterate over all entries where the value is greater than 1 in the table to find characters which appear more than once (this is O(m), with m <= n).Brouwer
@Brouwer I think you're suggesting an unordered_multiset which has quadratic construction time :(Wang
@JonathanMee -- Why a multiset? All you need is a map of characters to count. A std::unordered_map<char, int> fits.Edessa
No, more like a en.cppreference.com/w/cpp/container/unordered_mapBrouwer
@Brouwer That also has quadratic construction time.Wang
What's wrong with the old-fashioned int storage[CHAR_MAX - CHAR_MIN + 1]; ?Pericope
@Edessa unordered_multiset will populate without the need for me to populate it separately, and since size isn't an issue here that seemed easier.Wang
The map is empty at first, so that would be one of the constant constructors (marked (1)) from en.cppreference.com/w/cpp/container/unordered_map/unordered_mapBrouwer
@Pericope Derp. I don't know what I'm thinking. Can you just type that up, and I'll accept?Wang
@JonathanMee: Done. Although I tie myself up in knots a little over the whole - CHAR_MIN thing.Pericope
@Pericope Nothing wrong as long as bytes and chars are the same thing :-)Brouwer
@Pericope I see, thanks. I was more thinking about UTF-8, etc. but since the question is about std::string, it is fair to use an array of chars.Brouwer
@Bathsheba: You mostly replaced std::sort by counting_sort.Best
P
13

Here's one way, which is O(N) at the expense of maintaining storage for every possible char value.

#include <string>
#include <limits.h> // for CHAR_MIN and CHAR_MAX. Old habits die hard.

int main()
{
    std::string s("Hello World");        
    int storage[CHAR_MAX - CHAR_MIN + 1] = {};
    for (auto c : s){
        ++storage[c - CHAR_MIN];
    }

    for (int c = CHAR_MIN; c <= CHAR_MAX; ++c){
        if (storage[c - CHAR_MIN] > 1){
            std::cout << (char)c << " " << storage[c - CHAR_MIN] << "\n";
        }
    }    
}

This portable solution is complicated by the fact that char can be signed or unsigned.

Pericope answered 2/1, 2018 at 16:33 Comment(3)
Use numeric_limits: #48063662 that should make it portable.Wang
What does int storage[CHAR_MAX - CHAR_MIN + 1] = {}; help with exactly?Veracity
Sets up an array of appropriate size and initialises every element to 0. The C guys still have to write {0}. Pity them!Pericope
W
3

Here is what @bathsheba mentioned and with improvements by @Holt:

#include <string>
#include <climits>
#include <iostream>

void show_dup(const std::string& str) {
    const int sz = CHAR_MAX - CHAR_MIN + 1;
    int all_chars[sz] = { 0 };
    // O(N), N - the length of input string
    for(char c : str) {
        int idx = (int)c;
        all_chars[idx]++;
    }
    // O(sz) - constant. For ASCII char it will be 256
    for(int i = 0; i < sz; i++) {
        if (all_chars[i] > 1) {
            std::cout << (char)i << " = " << all_chars[i] << std::endl;
        }
    }
}

int main()
{
  std::string str("helloWorld");

  show_dup(str);
}
Wiser answered 2/1, 2018 at 16:31 Comment(7)
Would be better if this was valid c++ (int all_chars[sz]) without a using namespace std; on top.Popinjay
@Holt, May I ask what's wrong with my C++ code? It compiles and there are no warnings in pedantic mode.Wiser
Variable-length arrays are not standard C++, it's a C99 features or a compiler-specific extension. You get warning from both gcc and clang for this, e.g. "warning: ISO C++ forbids variable length array 'all_chars' [-Wvla]" from gcc.Popinjay
@Holt, note the way I have it is a compile-time evaluable constant expression.Pericope
@Pericope In your code yes, not here, unless you add a constexpr qualifier to sz.Popinjay
@Holt: Absolutely!Pericope
@Holt, seems you're right, thanks. But I was coding in online editor, cpp.sh/3n4dc, in C++ 14 mode. And GCC did not complain about anything in pedantic mode.Wiser

© 2022 - 2024 — McMap. All rights reserved.