You can achieve good results using nltk
, Textblob
, SpaCy
or any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.
import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
On my windows 10 2 cores, 4 processors, 8GB ram i5 hp laptop, in jupyter notebook, I ran some comparisons and here are the results.
For TextBlob:
%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])
And the output is
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 8.01 ms #average over 20 iterations
For nltk:
%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])
And the output is
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 7.09 ms #average over 20 iterations
For spacy:
%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])
And the output is
>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 30.19 ms #average over 20 iterations
It seems nltk
and TextBlob
are reasonably faster and this is to be expected since store nothing else about the input text, txt
. Spacy is way slower. One more thing. SpaCy
missed the noun NLP
while nltk
and TextBlob
got it. I would shot for nltk
or TextBlob
unless there is something else I wish to extract from the input txt
.
Check out a quick start into spacy
here.
Check out some basics about TextBlob
here.
Check out nltk
HowTos here
if pos.startswith('NN'):
, also use aset
orcollections.Counter
, don't keep a list. And do some map/reduce instead of a list comprehension. Otherwise, tryshallow parsing
, akachunking
– Giraldo