Efficient algorithm to find most common phrases in a large volume of text
Asked Answered
P

1

8

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.

Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.

Purposive answered 27/10, 2013 at 18:49 Comment(5)
Maybe you could look into something like a trie? Where a node also stores its occurrences, and a path down the trie forms a phrase?Cute
Taking your last paragraph as the real question, maybe your problem is just defining what a phrase is. If that's the question, consider a natural language processing tool like NLTK. In that context, an object that extracts phrases is called "chunker".Lir
How long is a phrase? The algorithm is pretty much the same whether you're doing one-word phrases or 10-word phrases. The only difference is the amount of data you have to process.Hospital
have a look here #186197Karimakarin
This problem is similar to collocation extraction, since a phrase is a collocation of several words.Glosseme
K
9

I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").

If so then I would recommend the following linear-time approach:

  1. Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
  2. Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
  3. Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
  4. Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.

You are now in a position to collect your common phrases.

It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.

Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

Knap answered 27/10, 2013 at 19:55 Comment(4)
I like the idea of uniquing words, that's one good thing. After that, constructing a sorted suffix array in linear time seems like an impossible endeavor in general, since sorting is linearithmic (unless I'm missing something obvious). Also, I think you are answering the wrong question. The question is about the most common phrase, not the longest common phrase.Lir
1) I agree the question is about the most common phrase. My answer is about finding len which gives the number of times each phrase of a certain number of words repeats. 2) The linear time method for constructing the suffix array makes use of radix sort to avoid needing nlogn time for the sorting.Knap
Radix sort is linear in the worst case only if the length of the keys is O(1). You are sorting n keys (the suffixes), each one of them of size at most n. The keys being all different, their length is at least log(n), and hence the complexity of that radix sort cannot be less than linearithmic.Lir
@naitoon: Items in suffix array are not arbitrary. And it can be exploited to construct a suffix array augmented with lcp array in O(n).Senn

© 2022 - 2024 — McMap. All rights reserved.