Computing the mode (most frequent element) of a set in linear time?
Asked Answered
T

3

6

In the book "The Algorithm Design Manual" by Skiena, computing the mode (most frequent element) of a set, is said to have a Ω(n log n) lower bound (this puzzles me), but also (correctly i guess) that no faster worst-case algorithm exists for computing the mode. I'm only puzzled by the lower bound being Ω(n log n).

See the page of the book on Google Books

But surely this could in some cases be computed in linear time (best case), e.g. by Java code like below (finds the most frequent character in a string), the "trick" being to count occurences using a hashtable. This seems obvious.

So, what am I missing in my understanding of the problem?

EDIT: (Mystery solved) As StriplingWarrior points out, the lower bound holds if only comparisons are used, i.e. no indexing of memory, see also: http://en.wikipedia.org/wiki/Element_distinctness_problem

// Linear time
char computeMode(String input) {
  // initialize currentMode to first char
  char[] chars = input.toCharArray();
  char currentMode = chars[0];
  int currentModeCount = 0;
  HashMap<Character, Integer> counts = new HashMap<Character, Integer>();
  for(char character : chars) {
    int count = putget(counts, character); // occurences so far
    // test whether character should be the new currentMode
    if(count > currentModeCount) {
      currentMode = character;
      currentModeCount = count; // also save the count
    }
  }
  return currentMode;
}

// Constant time
int putget(HashMap<Character, Integer> map, char character) {
  if(!map.containsKey(character)) {
    // if character not seen before, initialize to zero
    map.put(character, 0);
  }
 // increment
  int newValue = map.get(character) + 1;
  map.put(character, newValue);
  return newValue;
}
Tum answered 12/11, 2010 at 20:6 Comment(6)
Nothing seems to be mentioned in the errata list: cs.sunysb.edu/~skiena/algorist/book/errataTum
Cannot read the page. Some outlandish message, apparently danish.Dockage
Change the google.dk to google.com, and it'll work.Porta
fixed the link to go to google.com :)Tum
Never mind hash maps, they're distracting because of dubious complexity claims. Consider the problem of finding the most frequent element in a sequence consisting only of "0"s and "1"s. Obviously that's linear time, just count the things (likewise you can sort them in linear time, simplest possible case of a bucket sort). As StriplingWarrior says, it's the comparison version of the problem that has this bound, just as comparison sort has a Big_Omega(n log n) lower bound. Presumably when the book defines its terms earlier, it restricts the discussion somehow.Dominicdominica
... otherwise it wouldn't say "sorts the set in O(n log n) time", because of course fixed-width integers of size k bits can be sorted in O(k*n) time, for example with a binary radix sort. So when it talks about "numbers", it doesn't mean int, or any other fixed size numeric type.Dominicdominica
P
10

The author seems to be basing his logic on the assumption that comparison is the only operation available to you. Using a Hash-based data structure sort of gets around this by reducing the likelihood of needing to do comparisons in most cases to the point where you can basically do this in constant time.

However, if the numbers were hand-picked to always produce hash collisions, you would end up effectively turning your hash set into a list, which would make your algorithm into O(n²). As the author points out, simply sorting the values into a list first provides the best guaranteed algorithm, even though in most cases a hash set would be preferable.

Porta answered 12/11, 2010 at 20:26 Comment(5)
@Skipperkongen: The author actually uses Big-O notation when talking about finding the mode. He says "there is no faster worst-case algorithm for computing the mode" than the O(n log n) algorithm, and we know this because the problem of testing uniqueness in a set can be shown to have a Ω(n log n) lower bound.Porta
I accept that the best guaranteed algorithm is O(n log n). But do you agree that it is incorrect that element uniqueness has an Omega(n log n) lower bound?Tum
The wiki page for element distinctness actually mentions that the bound holds for "the algebraic computation tree model" which bans using elements to index memory... en.wikipedia.org/wiki/Element_distinctness_problemTum
So you were correct to say that it is assumed that comparison is the only operation available :) That is the bit that confused me, because that's often what you do in practice, i.e. index the memory :)Tum
@Skipperkongen: Yes, that Wikipedia article pretty well states that the problem's complexity has Θ(n log n) as an upper and lower bound, unless you know something specific about the data that allows you to make optimizations like a bucket sort.Porta
H
2

So, what am I missing in my understanding of the problem?

In many particular cases, an array or hash table suffices. In "the general case" it does not, because hash table access is not always constant time.

In order to guarantee constant time access, you must be able to guarantee that the number of keys that can possibly end up in each bin is bounded by some constant. For characters this is fairly easy, but if the set elements were, say, doubles or strings, it would not be (except in the purely academic sense that there are, e.g., a finite number of double values).

Homothermal answered 12/11, 2010 at 20:40 Comment(0)
M
2

Hash table lookups are amortized constant time, i.e., in general, the overall cost of looking up n random keys is O(n). In the worst case, they can be linear. Therefore, while in general they could reduce the order of mode calculation to O(n), in the worst case it would increase the order of mode calculation to O(n^2).

Matte answered 12/11, 2010 at 21:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.