Algorithms to detect phrases and keywords from text
Asked Answered
G

5

45

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.

If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?

Gazebo answered 29/10, 2009 at 13:11 Comment(5)
You may want to first clean up your input data by removing the common 'noise' words with a stop-word list.Sweaty
@teabot, yes, dealing with noise words is important but should not be done until these words have served other purposes. For example, if you intend to POS-tag the input text, noise words will be required. Even w/o POS-tagging, simpler techniques can make use of these noise words to infer expression boundaries etc.Omaromara
@Kimvais, please provide more background as to the purpose of the big 'tag list'. Is it just to index the text, is it to provide a way of creating tag clouds, is it a step towards categorization of the underlying documents, etc. ? With this extra info, SO contributors will be able to better respond to this relatively broad quest. It could just be that all you need is Swish-e as hinted by heferav, or it could be that you need more suggestions and ideas on how to tailor various NLP practices to serve a specific need (without necessarily "reinventing the wheel").Omaromara
@Omaromara - the data is communication related to tickets, the purpose would be to detect the keywords of the often lengthy communication so that we could have an automatic 'tag cloud' and lists of tickets that look relevant/related or deal with the same feature etc.Gazebo
There are various algorithms for named entity recognition, though this might be a slightly different problem.Bushido
O
35

Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).

Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.

Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.

  • POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
  • Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
  • if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)

[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]

[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]

Omaromara answered 29/10, 2009 at 13:25 Comment(0)
A
11

I'd start with a wonderful chapter, by Peter Norvig, in the O'Reilly book Beautiful Data. He provides the ngram data you'll need, along with beautiful Python code (which may solve your problems as-is, or with some modification) on his personal web site.

Ancylostomiasis answered 29/10, 2009 at 13:20 Comment(6)
That's a great chapter, totally worth reading -- I should know since I'm in the acknowledgments. :-) But it doesn't directly address the posted question. (I think my answer does.) What it's about is the kind of things you can do with n-gram statistics from a text corpus, and how, with examples like spelling correction, solving cryptograms, and segmenting wordscrammedtogetherwithoutspaces.Employ
Yes, but using the linked-to ngram data he could quickly blow through his data set and extract meaningful ngrams.Ancylostomiasis
It sounds like Kimvais already has computed n-gram data from their own dataset and is asking how to use the data to distinguish the word sequences that appear significantly more often than by chance.Employ
That's what I meant by "meaningful" :)Ancylostomiasis
OK. That's what my answer addresses, and that Norvig's code doesn't. I agree that the chapter is worth reading for a great intro to n-grams in general, though many people could follow the collocations refs without it.Employ
We violently agree, sir. Also, I love that you actually have a hypercard stack on your home page.Ancylostomiasis
E
8

It sounds like you're looking for collocation extraction. Manning and Schütze devote a chapter to the topic, explaining and evaluating the 'proposed formulas' mentioned in the Wikipedia article I linked to.

I can't fit the whole chapter into this response; hopefully some of their links will help. (NSP sounds particularly apposite.) nltk has a collocations module too, not mentioned by Manning and Schütze since their book predates it.

The other responses posted so far deal with statistical language processing and n-grams more generally; collocations are a specific subtopic.

Employ answered 29/10, 2009 at 17:46 Comment(0)
W
0

Do a matrix for words. Then if there are two consecutive words then add one to that appropriate cell.

For example you have this sentence.

mat['for']['example'] ++;
mat['example']['you'] ++;
mat['you']['have'] ++;
mat['have']['this'] ++;
mat['this']['sentence'] ++;

This will give you values for two consecutive words. You can do this word three words also. Beware this requires O(n^3) memory.

You can also use a heap for storing the data like:

heap['for example']++;
heap['example you']++;
Wirer answered 29/10, 2009 at 13:19 Comment(0)
G
0

One way would be to build yourself an automaton. most likely a Nondeterministic Finite Automaton(NFA). NFA

Another more simple way would be to create a file that has contains the words and/or word groups that you want to ignore, find, compare, etc. and store them in memory when the program starts and then you can compare the file you are parsing with the word/word groups that are contained in the file.

Girth answered 29/10, 2009 at 15:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.