The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence
Asked Answered
C

19

94

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

  2. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

  3. After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

Cachet answered 9/10, 2008 at 2:24 Comment(7)
Would you use merge sort or quicksort for the O(n*logn) sort?Finfoot
For practical uses, Aaron Maenpaa's answer of counting on a sample is best. It's not like the most frequent words will hide from your sample. For you complexity geeks, it's O(1) since the size of the sample is fixed. You don't get the exact counts, but you're not asking for them either.Vet
If what you want is a review of your complexity analysis, then I'd better mention: if n is the number of words in your text and m is the number of different words (types, we call them), step 1 is O(n), but step 2 is O(m.lg(m)), and m << n (you may have billions words and not reach a million types, try it out). So even with a dummy algorithm, it's still O(n + m lg(m)) = O(n).Vet
Pls add an assumption to the question that we've enough main memory to hold all words of the big text. It would be interesting to see approaches to find k=100 words from 10GB file (i.e. all words won't fit in 4GB RAM)!!Snout
@Snout how would we do it if it exceeds RAM size?Retrocede
@Retrocede There are many external sorting algorithms available. en.wikipedia.org/wiki/…. could be a good starting point.Snout
I’m voting to close this question because it does not ask a question.Retired
P
73

This can be done in O(n) time

Solution 1:

Steps:

  1. Count words and hash it, which will end up in the structure like this

    var hash = {
      "I" : 13,
      "like" : 3,
      "meow" : 3,
      "geek" : 3,
      "burger" : 2,
      "cat" : 1,
      "foo" : 100,
      ...
      ...
    
  2. Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size

  3. Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:

      0   1      2            3                  100
    [[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
    
  4. Then just traverse the array from the end, and collect the k words.

Solution 2:

Steps:

  1. Same as above
  2. Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
  3. After traversing through the array, we just convert the min heap to array and return the array.
Peroxidase answered 12/3, 2014 at 3:51 Comment(5)
Your solution (1) is an O(n) bucket sort replacing a standard O(n lg n) comparison sort. Your approach requires additional space for the bucket structure, but comparison sorts can be done in place. Your solution (2) runs in time O(n lg k) -- that is, O(n) to iterate over all words and O(lg k) to add each one into the heap.Forsaken
The first solution does require more space, but it is important to emphasize that it is in fact O(n) in time. 1: Hash frequencies keyed by word, O(n); 2: Traverse frequency hash, create second hash keyed by frequency. This is O(n) to traverse the hash and O(1) to add a word to the list of words at that frequency. 3: Traverse hash down from max frequency until you hit k. At most, O(n). Total = 3 * O(n) = O(n).Spiers
Typically when counting words, your number of buckets in solution 1 is widely overestimated (because the number one most frequent word is so much more frequent than the second and third best), so your array is sparse and inefficient.Vet
Your solution #1 doesn't work when k (the number of frequent words) is less than the number of occurrence the most frequent word(ie., 100 in this case) Of course, that might not happen in practice, but one should not assume!Gogh
@OneTwoThree the solution proposed is just an example. The number will be based on demand.Peroxidase
K
22

You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.

If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.

Kelseykelsi answered 9/10, 2008 at 7:55 Comment(0)
C
14

A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.

The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).

After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).

So this strategy is:

  1. Go through each word and insert it into a hash table: O(n)
  2. Select the Kth smallest element: O(n)
  3. Partition around that element: O(n)

If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).

The O(n) time bound is optimal within a constant factor because we must examine each word at least once.

The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.

Ciapha answered 26/12, 2008 at 0:54 Comment(5)
When we select the Kth smallest element, what gets selected is the Kth smallest hash-key. It is not necessary that there are exactly K words in the left partition of Step 3.Gulden
You will not be able to run "medians of medians" on the hash table as it does swaps. You would have to copy the data from the hash table to an temp array. So, O(n) storage will be reqd.Connel
I don't understand how can you select the Kth smallest element in O(n) ?Rheotaxis
Check this out for the algorithm for finding Kth smallest element in O(n) - wikiwand.com/en/Median_of_mediansHynda
The complexity is same even if you use hash table + min heap. i don't see any optimization.Wellfound
G
9

If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.

Edit:

By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.

This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.

Gosselin answered 9/10, 2008 at 2:26 Comment(2)
Sometimes you have to do this many times over, for example if you're trying to get the list of frequent words per website, or per subject. In that case, "without breaking a sweat" doesn't really cut it. You still need to find a way to do it in as efficiently as possible.Reniti
+1 for a practical answer that doesn't adress the irrelevant complexity issues. @itsadok: For each run: if it's big enough, sample it ; if it's not, then gaining a log factor is irrelevant.Vet
B
2

You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.

This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].

If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.

Brinson answered 9/10, 2008 at 3:22 Comment(3)
What you're describing is called a 'trie'.Kelseykelsi
Hi Strilanc. Can you explain the process of partition in details?Cachet
how does this not involve sorting?? once you have the trie, how do you pluck out the k words with the largest frequencies. doesnt make any senseRenz
K
2

You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.

Kinfolk answered 26/12, 2008 at 0:33 Comment(0)
T
2

Your problem is same as this- http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/

Use Trie and min heap to efficieinty solve it.

Talcahuano answered 25/1, 2015 at 17:35 Comment(0)
V
2

If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.

Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.

As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).

Practically, the long step is counting.

Vet answered 5/5, 2015 at 22:49 Comment(0)
A
2
  1. Utilize memory efficient data structure to store the words
  2. Use MaxHeap, to find the top K frequent words.

Here is the code

import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;

public class TopKFrequentItems {

private int maxSize;

private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;

public TopKFrequentItems(int k) {
    this.maxSize = k;
    this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}

private Comparator<TrieEntry> maxHeapComparator() {
    return new Comparator<TrieEntry>() {
        @Override
        public int compare(TrieEntry o1, TrieEntry o2) {
            return o1.frequency - o2.frequency;
        }           
    };
}

public void add(String word) {
    this.trie.insert(word);
}

public List<TopK> getItems() {

    for (TrieEntry trieEntry : this.trie.getAll()) {
        if (this.maxHeap.size() < this.maxSize) {
            this.maxHeap.add(trieEntry);
        } else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
            this.maxHeap.remove();
            this.maxHeap.add(trieEntry);
        }
    }
    List<TopK> result = new ArrayList<TopK>();
    for (TrieEntry entry : this.maxHeap) {
        result.add(new TopK(entry));
    }       
    return result;
}

public static class TopK {
    public String item;
    public int frequency;

    public TopK(String item, int frequency) {
        this.item = item;
        this.frequency = frequency;
    }
    public TopK(TrieEntry entry) {
        this(entry.word, entry.frequency);
    }
    @Override
    public String toString() {
        return String.format("TopK [item=%s, frequency=%s]", item, frequency);
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + frequency;
        result = prime * result + ((item == null) ? 0 : item.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        TopK other = (TopK) obj;
        if (frequency != other.frequency)
            return false;
        if (item == null) {
            if (other.item != null)
                return false;
        } else if (!item.equals(other.item))
            return false;
        return true;
    }

}   

}

Here is the unit tests

@Test
public void test() {
    TopKFrequentItems stream = new TopKFrequentItems(2);

    stream.add("hell");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hero");
    stream.add("hero");
    stream.add("hero");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("home");
    stream.add("go");
    stream.add("go");
    assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}

For more details refer this test case

Araroba answered 15/3, 2016 at 4:18 Comment(0)
S
1
  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above

  2. While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element

To restrict the size of treeset see this link

Supporting answered 11/4, 2017 at 3:31 Comment(0)
C
0

Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2. as you mentioned "partitioning using the first letter of words", we got ("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold") "then partitioning the largest multi-word set using the next character until you have k single-word sets." it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.

Perhaps I misunderstand your point. Can you please detail your process about partition?

Cachet answered 9/10, 2008 at 11:44 Comment(0)
R
0

I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.

Raddatz answered 13/6, 2013 at 3:3 Comment(3)
This is generally a good direction - but it has a flaw. when the count is increased, we won't be just checking "its predecessor", but we'll need to check the "predecessors". for example, there is a big chance that the array will be [4,3,1,1,1,1,1,1,1,1,1] - the 1's can be as many - that will make it less efficient since we'll have to look back through all the predecessors to find the proper one to swap.Chronogram
Wouldn't this in fact be way worse than O(n)? More like O(n^2) as it is essentially a rather inefficient sort?Marymarya
Hi Shawn. Yes, I agree with you. But I suspect that the problem you mentioned is fundamental to the problem. In fact, if instead of keeping just a sorted array of values, we could go ahead an keep an array of (value, index) pairs, where the index points to the first occurrence of the repeated element, the problem should be solvable in O(n) time. For example, [4,3,1,1,1,1,1,1,1,1,1] will look like [(4,0), (3,1), (1,2), (1,2), (1,2, ..., (1,2)]; the indices start from 0.Raddatz
C
0

I was struggling with this as well and get inspired by @aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:

  1. for each word, store it as part of map of it's occurrence: Map<String, Integer>.
  2. then, based on the count, remove it from the previous count set, and add it into the new count set.

The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.

The code

public class WordFrequencyCounter {
    private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
    Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
    List<Set<String>> reverseCounters = new ArrayList<Set<String>>();

    private static class MutableCounter {
        int i = 1;
    }

    public List<String> countMostFrequentWords(String text, int max) {
        int lastPosition = 0;
        int length = text.length();
        for (int i = 0; i < length; i++) {
            char c = text.charAt(i);
            if (c <= WORD_SEPARATOR_MAX) {
                if (i != lastPosition) {
                    String word = text.substring(lastPosition, i);
                    MutableCounter counter = counters.get(word);
                    if (counter == null) {
                        counter = new MutableCounter();
                        counters.put(word, counter);
                    } else {
                        Set<String> strings = reverseCounters.get(counter.i);
                        strings.remove(word);
                        counter.i ++;
                    }
                    addToReverseLookup(counter.i, word);
                }
                lastPosition = i + 1;
            }
        }

        List<String> ret = new ArrayList<String>();
        int count = 0;
        for (int i = reverseCounters.size() - 1; i >= 0; i--) {
            Set<String> strings = reverseCounters.get(i);
            for (String s : strings) {
                ret.add(s);
                System.out.print(s + ":" + i);
                count++;
                if (count == max) break;
            }
            if (count == max) break;
        }
        return ret;
    }

    private void addToReverseLookup(int count, String word) {
        while (count >= reverseCounters.size()) {
            reverseCounters.add(new HashSet<String>());
        }
        Set<String> strings = reverseCounters.get(count);
        strings.add(word);
    }

}
Chronogram answered 19/10, 2013 at 0:26 Comment(1)
Yes , instead of a list, we can go for a TreeMap which will have the list of Keys (number) and values Array list which will have the list of values. We can use the decending order using top preference to pull the value out.Haematinic
W
0

I just find out the other solution for this problem. But I am not sure it is right. Solution:

  1. Use a Hash table to record all words' frequency T(n) = O(n)
  2. Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
  3. Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
  4. After traverse the whole hash table, the result is in this buffer.
  5. The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Welloiled answered 9/2, 2014 at 21:28 Comment(0)
F
0

Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.

http://en.wikipedia.org/wiki/Trie

Frustule answered 9/10, 2014 at 12:30 Comment(0)
P
0

This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

Also there is an implementation of it here.

Priam answered 4/3, 2015 at 13:22 Comment(1)
Your link returns 404.Eisler
W
0

Simplest code to get the occurrence of most frequently used word.

 function strOccurence(str){
    var arr = str.split(" ");
    var length = arr.length,temp = {},max; 
    while(length--){
    if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
    {
        temp[arr[length]] = 1;
    }
    else if(arr[length].trim().length > 0)
    {
        temp[arr[length]] = temp[arr[length]] + 1;

    }
}
    console.log(temp);
    var max = [];
    for(i in temp)
    {
        max[temp[i]] = i;
    }
    console.log(max[max.length])
   //if you want second highest
   console.log(max[max.length - 2])
}
Wonderland answered 8/10, 2015 at 12:35 Comment(0)
G
0

In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.

import java.io.*;
import java.lang.reflect.Array;
import java.util.*;

public class TopKWordsTextFile {

    static class SortObject implements Comparable<SortObject>{

        private String key;
        private int value;

        public SortObject(String key, int value) {
            super();
            this.key = key;
            this.value = value;
        }

        @Override
        public int compareTo(SortObject o) {
            //descending order
            return o.value - this.value;
        }
    }


    public static void main(String[] args) {
        HashMap<String,Integer> hm = new HashMap<>();
        int k = 1;
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));

            String line;
            while ((line = br.readLine()) != null) {
                // process the line.
                //System.out.println(line);
                String[] tokens = line.split(" ");
                for(int i=0; i<tokens.length; i++){
                    if(hm.containsKey(tokens[i])){
                        //If the key already exists
                        Integer prev = hm.get(tokens[i]);
                        hm.put(tokens[i],prev+1);
                    }else{
                        //If the key doesn't exist
                        hm.put(tokens[i],1);
                    }
                }
            }
            //Close the input
            br.close();
            //Print all words with their repetitions. You can use 3 for printing top 3 words.
            k = hm.size();
            // Get a set of the entries
            Set set = hm.entrySet();
            // Get an iterator
            Iterator i = set.iterator();
            int index = 0;
            // Display elements
            SortObject[] objects = new SortObject[hm.size()];
            while(i.hasNext()) {
                Map.Entry e = (Map.Entry)i.next();
                //System.out.print("Key: "+e.getKey() + ": ");
                //System.out.println(" Value: "+e.getValue());
                String tempS = (String) e.getKey();
                int tempI = (int) e.getValue();
                objects[index] = new SortObject(tempS,tempI);
                index++;
            }
            System.out.println();
            //Sort the array
            Arrays.sort(objects);
            //Print top k
            for(int j=0; j<k; j++){
                System.out.println(objects[j].key+":"+objects[j].value);
            }


        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.

Gris answered 25/9, 2016 at 19:33 Comment(2)
In which way does this improve on the approach sketched in the question? (Please do not leave out comments from the code presented on SE.) (I recommend to use Java built-in features like foreach loops and streams processing?)Bloodthirsty
As you know, one of the most important factors in designing an efficient algorithm is choosing the right data structure. Then, it is important how you approach the problem. For example, you need to attack a problem by divide and conquer. You need to attack another one by greedy. As you know Oracle company is working on Java. They are one of the best tech companies in the world. There are some of the most brilliant engineers working there on Java built-in features. So, these features are well-tested and bullet proof. If we can utilize them, is better to use them in my opinion.Gris
N
0
**

C++11 Implementation of the above thought

**

class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {

    unordered_map<int,int> map;
    for(int num : nums){
        map[num]++;
    }

    vector<int> res;
    // we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
    // pair<first, second>: first is frequency,  second is number 
    priority_queue<pair<int,int>> pq; 
    for(auto it = map.begin(); it != map.end(); it++){
        pq.push(make_pair(it->second, it->first));

        // onece the size bigger than size-k, we will pop the value, which is the top k frequent element value 

        if(pq.size() > (int)map.size() - k){
            res.push_back(pq.top().second);
            pq.pop();
        }
    }
    return res;

}

};

Nerin answered 1/9, 2017 at 15:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.