How to extract common / significant phrases from a series of text entries
Asked Answered
R

4

73

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).

My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:

"Try the hamburger" (in 44 reviews)

e.g., the "Review Highlights" section of this page:

http://www.yelp.com/biz/sushi-gen-los-angeles/

I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here.

Romance answered 16/3, 2010 at 8:42 Comment(2)
with nltk, it's easy enough to get bigrams and trigrams, but what I'm looking for are phrases that are more likely 7 - 8 words in length. I have not figured out how to make nltk (or some other method) provide such 'octograms' and above.Romance
Maybe you can try graph based algorithms like TextRank - github.com/ceteri/pytextrankTribade
A
104

I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.

To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.

The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)
Admirable answered 16/3, 2010 at 9:35 Comment(3)
Yes, I agree-- and looking at that page, I can get as far as bi and tri-grams, but how is this extended to n-grams? I believe I'll need phrases of length > 5 to be truly interesting, and perhaps I'm expressing my ignorance, but this demo page only lets me get 2 and 3 word sets?Romance
For that, I think you'll need to extend nltk.collocations.AbstractCollocationFinder, using BigramCollocationFinder and TrigramCollocationFinder as a guide, see nltk.googlecode.com/svn/trunk/doc/api/… . But, are you sure you really need such long phrases? On Yelp, it looks like they're highlighting single words and collocations with a couple of words in them, in your linked example they have sashimi, Little Tokyo, and fish. They then select one complete sentence that contains each interesting word or phrase.Admirable
This. I think you are absolutely correct. Brilliant (and elegant) observation!Romance
P
5

if you just want to get to larger than 3 ngrams you can try this. I'm assuming you've stripped out all the junk like HTML etc.

import nltk
ngramlist=[]
raw=<yourtextfile here>

x=1
ngramlimit=6
tokens=nltk.word_tokenize(raw)

while x <= ngramlimit:
  ngramlist.extend(nltk.ngrams(tokens, x))
  x+=1

Probably not very pythonic as I've only been doing this a month or so myself, but might be of help!

Pulse answered 28/3, 2010 at 21:12 Comment(3)
-1 this did nothing for me. i am in the same situation as the OP, and your method just returned an enormous list of tuples that followed the structure of the original text. how should i proceed?Wampumpeag
Once you have that list you need to loop through it to count the presence of unique ngrams. One way to do this is by creating a dict where the key is the ngram and incrementing it each time you get a matchPulse
I don't get this either. How do you count the unique grams? it's a bag of individual words.Phototypography
M
5

I think what you're looking for is chunking. I recommended reading chapter 7 of the NLTK book or maybe my own article on chunk extraction. Both of these assume knowledge of part-of-speech tagging, which is covered in chapter 5.

Marjoriemarjory answered 15/4, 2010 at 2:37 Comment(2)
i really don't see what chunking has to do with it.Wampumpeag
Chunking can parse phrases, and once you have phrases, then you can identify common & significant phrases.Marjoriemarjory
T
0

Well, for a start you would probably have to remove all HTML tags (search for "<[^>]*>" and replace it with ""). After that, you could try the naive approach of looking for the longest common substrings between every two text items, but I don't think you'd get very good results. You might do better by normalizing the words (reducing them to their base form, removing all accents, setting everything to lower or upper case) first and then analyse. Again, depending on what you want to accomplish, you might be able to cluster the text items better if you allow for some word order flexibility, i.e. treat the text items as bags of normalized words and measure bag content similarity.

I've commented on a similar (although not identical) topic here.

Telecast answered 16/3, 2010 at 9:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.