Efficient algorithm to find most common phrases in a large volume of text - McMap

About

Efficient algorithm to find most common phrases in a large volume of text

Asked 27/10, 2013 at 18:49 Answered 27/10, 2013 at 19:55

Solved algorithm data-structures frequency frequency-analysis word-frequency

P

1

8

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.

Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.

Purposive answered 27/10, 2013 at 18:49 Comment(5)

Maybe you could look into something like a trie? Where a node also stores its occurrences, and a path down the trie forms a phrase? – Cute 27/10, 2013 at 18:52

Taking your last paragraph as the real question, maybe your problem is just defining what a phrase is. If that's the question, consider a natural language processing tool like NLTK. In that context, an object that extracts phrases is called "chunker". – Lir 27/10, 2013 at 23:31

How long is a phrase? The algorithm is pretty much the same whether you're doing one-word phrases or 10-word phrases. The only difference is the amount of data you have to process. – Hospital 28/10, 2013 at 15:22

have a look here #186197 – Karimakarin 25/7, 2017 at 16:28

This problem is similar to collocation extraction, since a phrase is a collocation of several words. – Glosseme 13/7, 2022 at 19:50

K

9

I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").

If so then I would recommend the following linear-time approach:

Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.

You are now in a position to collect your common phrases.

It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.

Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

Knap answered 27/10, 2013 at 19:55 Comment(4)

I like the idea of uniquing words, that's one good thing. After that, constructing a sorted suffix array in linear time seems like an impossible endeavor in general, since sorting is linearithmic (unless I'm missing something obvious). Also, I think you are answering the wrong question. The question is about the most common phrase, not the longest common phrase. – Lir 27/10, 2013 at 20:40

1) I agree the question is about the most common phrase. My answer is about finding len which gives the number of times each phrase of a certain number of words repeats. 2) The linear time method for constructing the suffix array makes use of radix sort to avoid needing nlogn time for the sorting. – Knap 27/10, 2013 at 22:10

Radix sort is linear in the worst case only if the length of the keys is O(1). You are sorting n keys (the suffixes), each one of them of size at most n. The keys being all different, their length is at least log(n), and hence the complexity of that radix sort cannot be less than linearithmic. – Lir 27/10, 2013 at 23:12

@naitoon: Items in suffix array are not arbitrary. And it can be exploited to construct a suffix array augmented with lcp array in O(n). – Senn 26/2, 2014 at 18:22

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.