The nltk
package is specialised in handling text and has various functions you can use to 'tokenize' text into words.
You can either use the RegexpTokenizer
, or the word_tokenize
with a slight adaptation.
The easiest and simplest is the RegexpTokenizer
:
import nltk
text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."
result = nltk.RegexpTokenizer(r'\w+').tokenize(text)
Which returns:
`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`
Or you can use the slightly smarter word_tokenize
which is able to split most contractions like didn't
into did
and n't
.
import re
import nltk
nltk.download('punkt') # You only have to do this once
def contains_letters(phrase):
return bool(re.search('[a-zA-Z]', phrase))
text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."
result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]
which returns:
['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']