How to parse sentences based on lexical content (phrases) with Python-NLTK

P

2

12

Can Python-NLTK recognize input string and parse it not only based on white space but also on the content? Say, "computer system" became a phrases in this situation. Can anyone provide a sample code?

input String: "A survey of user opinion of computer system response time"

Expected output: ["A", "survey", "of", "user", "opinion", "of", "computer system", "response", "time"]

Pure answered 1/12, 2014 at 17:56 Comment(1)

You want a parse tree: nltk.org/book/ch08.html For some predefined parsers that you don't need to define check out; https://mcmap.net/q/265355/-english-grammar-for-parsing-in-nltk-closed/456188 – Mcginty 1/12, 2014 at 19:9

S

18

The technology you're looking for is called multiple names from multiple sub-fields or sub-sub-fields of linguistics and computing.

Keyphrase Extraction
- From Information Retrieval, mainly use for improving indexing/querying for sear
- Read this recent survey paper: http://www.hlt.utdallas.edu/~saidul/acl14.pdf
- (I personally) strongly recommend: https://code.google.com/p/jatetoolkit/ and of course the famous https://code.google.com/p/kea-algorithm/ (from the people who brought you WEKA, http://www.cs.waikato.ac.nz/ml/weka/)
- For python, possibly https://github.com/aneesha/RAKE

Chunking
- From Natural Language Processing, it's also call shallow parsing,
- Read Steve Abney's work on how it came about: http://www.vinartus.net/spa/90e.pdf
- Major NLP framework and toolkits should have them (e.g. OpenNLP, GATE, NLTK* (do note that NLTK's default chunker only works for name entities))
- Stanford NLP has one too: http://nlp.stanford.edu/projects/shallow-parsing.shtml

I'll give an example of the NE chunker in NLTK:

>>> from nltk import word_tokenize, ne_chunk, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
>>> for i in chunked:
...     print i
... 
('A', 'DT')
('survey', 'NN')
('of', 'IN')
('user', 'NN')
('opinion', 'NN')
('of', 'IN')
('computer', 'NN')
('system', 'NN')
('response', 'NN')
('time', 'NN')

With named entities:

>>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
>>> for i in chunked:
...     print i
... 
(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'NNS')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)

You can see it's pretty much flawed, better something than nothing, i guess.

Multi-Word Expression extraction
- Hot topic in NLP, everyone wants to extract them for one reason or another
- Most notable work by Ivan Sag: http://lingo.stanford.edu/pubs/WP-2001-03.pdf and a miasma of all sorts of extraction algorithms and extracted usage from ACL papers
- As much as this MWE is very mysterious and we don't know how to classify them automatically or extract them properly, there's no proper tools for this (strangely the output researchers of MWE wants often can be obtained with Keyphrase Extraction or chunking...)

Terminology Extraction
- This comes from translation studies where they want the translators to use the correct technical word when translating a document.
- Do note that terminology comes with a cornocopia of ISO standards that one should follows because of the convoluted translation industry that generates billions in income...
- Monolingually, i've no idea what makes them different from terminology extractor, same algorithms, different interface... I guess the only thing about some term extractors is the ability to do it bilingually and produce a dictionary automatically.
Here's a few tools
- https://github.com/srijiths/jtopia and
- http://fivefilters.org/term-extraction/
- https://github.com/turian/topia.termextract
- https://www.airpair.com/nlp/keyword-extraction-tutorial
- http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/
- Note on tools: there's still no one tool that stands out for term extraction though. And because of then big money involved, it's always some API calls and most code are "semi-open".. mostly closed. Then again, SEO is also big money, possibly it's just a culture thing in translation industry to be super secretive.

Now back to OP's question.

Q: Can NLTK extract "computer system" as a phrase?

A: Not really

As shown above, NLTK has pre-trained chunker but it works on name entities and even so, not all named entities are well recognized.

Possibly OP could try out more radical idea, let's assume that a sequence of nouns together always form a phrase:

>>> from nltk import word_tokenize, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> tagged = pos_tag(word_tokenize(sent))
>>> chunks = []
>>> current_chunk = []
>>> for word, pos in tagged:
...     if pos.startswith('N'):
...             current_chunk.append((word,pos))
...     else:
...             if current_chunk:
...                     chunks.append(current_chunk)
...             current_chunk = []
... 
>>> chunks
[[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
>>> for i in chunks:
...     print i
... 
[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
[('survey', 'NN')]
[('user', 'NN'), ('opinion', 'NN')]

So even with that solution, seems like trying to get 'computer system' alone is hard. But if you think for a bit seems like getting 'computer system response time' is a more valid phrase than 'computer system'.

Do not that all interpretations of computer system response time seem valid:

[computer system response time]
[computer [system [response [time]]]]
[computer system] [response time]
[computer [system response time]]

And many many more possible interpretations. So you've got to ask, what are you using the extracted phrase for and then see how to proceed with cutting long phrases like 'computer system response time'.

Schoenberg answered 2/12, 2014 at 0:50 Comment(0)

A

0

The Python library Constituent-Treelib, which is based on NLTK among other libraries, can be used to extract arbitrary phrasal categories (e.g., noun or verb phrases) from a given sentence.

First, install it via pip install constituent-treelib

Next, perform the following steps:

from constituent_treelib import ConstituentTree, Language

# Specify the language for the sentence and underlying models (here English)
language = Language.English

# Set which spaCy model should be used (default: Medium)
spacy_model_size = ConstituentTree.SpacyModelSize.Medium

# Create the pipeline (note, all models will be downloaded and installed automatically)
nlp = ConstituentTree.create_pipeline(language, spacy_model_size)

# Define the sentence
sentence = "A survey of user opinion of computer system response time"

# Create a ConstituentTree instance to access the functions that extract the phrases
tree = ConstituentTree(sentence, nlp)

Given tree, you can now extract all phrases from the sentence via:

tree.extract_all_phrases(min_words_in_phrases=1)

# Result: 
{'NP': ['A survey of user opinion of computer system response time',
        'A survey',
        'user opinion of computer system response time',
        'user opinion',
        'computer system response time'],
'NML': ['computer system'],
 'PP': ['of computer system response time']}

In case you want to avoid nested phrases, simple use:

tree.extract_all_phrases(avoid_nested_phrases=True)

# Result:
{'NP': ['A survey of user opinion of computer system response time'],
'NML': ['computer system'],
 'PP': ['of computer system response time']}

Adey answered 18/6, 2024 at 10:35 Comment(0)

Recommended topics

Hot tags