Find most repeated phrase on huge text
Asked Answered
C

6

22

I have huge text data. My entire database is text format in UTF-8

I need to have list of most repeated phrase on my whole text data.

For example my desire output something like this:

{
  'a': 423412341,
  'this': 423412341,
  'is': 322472341,
  'this is': 222472341,
  'this is a': 122472341,
  'this is a my': 5235634
}

Process and store each phrase take huge size of database. For example store in MySQL or MongoDB. Question is is there any more efficient database or alghorithm for find this result ? Solr, Elasticsearch or etc ...

I think i have max 10 words in each phrase can be good for me.

Communist answered 20/4, 2015 at 16:41 Comment(2)
I suggest including a maximum of the numbers of words in your phrases.Guava
The problem of finding common phrases in a text is called "collocation extraction."Hardnosed
C
4

I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.

  1. Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).

  2. Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.


A Short Description of The Sampling Algorithm

This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.

In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.

In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.

Output the top k of the items resulting from the second stage.

Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.

Clout answered 6/5, 2015 at 14:44 Comment(4)
Good answer... seems be good i think about it ... i need to know there is no alternate solution for fulltext search database like Solr or ElasticSearch? I think MongoDB is the best choose for this algorithm.Communist
Thanks. If your entire database is in text form, I wouldn't go for any of these tools, and would instead implement the above directly using some programming language. E.g., what would MongoDB give you here?Clout
Here is a SO question about solr for (a limited version of) this problem. As you can see in the comments, it might be quite slow. I'd suggest programming this directly.Clout
Good approach, but the Apriori algorithm doesn't apply as described here; the top 1-gram isn't necessarily part of the top 2-gram, or of any repeated 2-gram for that matter. All you can say is any n-gram with frequency f must contain a prefix (and a suffix) that is an (n-1)-gram of at least frequency f.Kiln
V
1

If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:

http://127.0.0.1:8983/solr/admin/luke?fl=fulltext&numTerms=100

Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:

http://127.0.0.1:8983/solr/terms?terms.fl=fulltext&terms.limit=100
Visually answered 8/5, 2015 at 10:32 Comment(0)
V
1

Have you considered using MapReduce?

Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.

Original paper on MapReduce by Google.

Vancouver answered 8/5, 2015 at 11:33 Comment(0)
K
1

tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.

EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.

After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).

Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).

Kingdom answered 8/5, 2015 at 13:36 Comment(0)
K
0

The top answer by Amy Tavori states:

Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.

While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):

Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.

Looking at the most frequent single words, we get:

1-Gram  Frequency
------  ---------
a       12
will    6
corpus  5
tricksy 4
or      3
from    2
it      2
the     2
very    2
you     2

The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:

2-Gram          Frequency
------          ---------
corpus will     5
tricksy corpus  4
or a            3
a very          2

And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:

3-Gram                Frequency
------                ---------
tricksy corpus will   4

To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.

Kiln answered 19/9, 2018 at 20:15 Comment(0)
A
-2

This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

Abeyance answered 7/5, 2015 at 13:3 Comment(3)
problem is scaling ... huge text doesn't work with these type of manipulationCommunist
scaling? really? are you performing this calculation in real time? I hope not. Even if you are, you can build a caching layer on top of it. It's not like the "huge text" changes. Also, define huge text. How many characters are we talking about here? no matter what type of data store you use, you have to read the data into memory in order to analyze it. So in this case using a database has no value because a "LIKE" system won't collect the data you need.Abeyance
going further, you question does not mention any scalability requirements. But if it did, a decent Linux box running HHVM would analyze the text as fast as any of the top platform solutions available today. The only platform that can compete with HHVM is Node.js or GO.Abeyance

© 2022 - 2024 — McMap. All rights reserved.