First lets take a look at the count the tokens from both approach and see the most_common
words:
>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk.corpus import webtext
>>> counts_from_wordtok = Counter(word_tokenize(webtext.raw('wine.txt')))
>>> counts_from_wordtok.most_common(10)
[(u'.', 2824), (u',', 1550), (u'a', 821), (u'and', 786), (u'the', 706), (u'***', 608), (u'-', 518), (u'of', 482), (u'but', 474), (u'I', 390)]
>>> counts_from_words = Counter(webtext.words('wine.txt'))
>>> counts_from_words.most_common(10)
[(u'.', 2772), (u',', 1536), (u'-', 832), (u'a', 821), (u'and', 787), (u'the', 706), (u'***', 498), (u'of', 482), (u'but', 474), (u'I', 392)]
>>> len(word_tokenize(webtext.raw('wine.txt')))
31140
>>> len(webtext.words('wine.txt'))
31350
Something smells fishy...
Lets take a closer look of how webtext
interface comes about, it uses the LazyCorpusLoader
at https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L235
webtext = LazyCorpusLoader(
'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')
If we look at how PlaintextCorpusReader
is loading the corpus and tokenizing https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L41
class PlaintextCorpusReader(CorpusReader):
CorpusView = StreamBackedCorpusView
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
Ah ha! It's using the WordPunctTokenizer
instead of the default modified TreebankTokenizer
The WordPunctTokenizer
is a simplistic tokenizer found at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/regexp.py#L171
The word_tokenize()
function is a modified TreebankTokenizer
unique to NLTK https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L97
If we look at what's webtext.words()
calling, we follow https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L81
def words(self, fileids=None):
"""
:return: the given file(s) as a list of words
and punctuation symbols.
:rtype: list(str)
"""
return concat([self.CorpusView(path, self._read_word_block, encoding=enc)
for (path, enc, fileid)
in self.abspaths(fileids, True, True)])
to reach _read_word_block()
at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119 :
def _read_word_block(self, stream):
words = []
for i in range(20): # Read 20 lines at a time.
words.extend(self._word_tokenizer.tokenize(stream.readline()))
return words
It's reading the file line by line!
So if we load the webtext
corpus and use the WordPunctTokenizer
we get the same number:
>>> from nltk.corpus import webtext
>>> from nltk.tokenize import WordPunctTokenizer
>>> wpt = WordPunctTokenizer()
>>> len(wpt.tokenize(webtext.raw('wine.txt')))
31350
>>> len(webtext.words('wine.txt'))
31350
More mysteries
You can also create a new webtext
corpus object by specifying the tokenizer object e.g.
>>> from nltk.tokenize import _treebank_word_tokenizer
>>> from nltk.corpus import LazyCorpusLoader, PlaintextCorpusReader
>>> from nltk.corpus import webtext
# LazyCorpusLoader expects a tokenizer object,
# but word_tokenize() is a function, so we have to
# import the tokenizer object that word_tokenize wraps around
>>> webtext2 = LazyCorpusLoader('webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2', word_tokenizer=_treebank_word_tokenizer)
>>> len(webtext2.words('wine.txt'))
28385
>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140
>>> list(webtext2.words('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine.', u'Polished', u'leather', u'and', u'strawberries.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now.', u'***', u'Liquorice', u',', u'cherry', u'fruit.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish.', u'**', u'Thin', u'and', u'completely', u'uninspiring.', u'*', u'Rough.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great', u'complexity.', u'A', u'good', u'***', u'Charming', u',', u'violet-fragranced', u'nose.']
>>> word_tokenize(webtext2.raw('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine', u'.', u'Polished', u'leather', u'and', u'strawberries', u'.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now', u'.', u'***', u'Liquorice', u',', u'cherry', u'fruit', u'.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish', u'.', u'**', u'Thin', u'and', u'completely', u'uninspiring', u'.', u'*', u'Rough', u'.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch', u'.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great']
That's because word_tokenize
does a sent_tokenize
before actually tokenizing sentences into words: https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L113
But the PlaintextCorpusReader. _read_word_block()
doesn't do sent_tokenize
beforehand, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119
Let's do a recount with sentence tokenization first:
>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140
>>> sum(len(tokenized_sent) for tokenized_sent in webtext2.sents('wine.txt'))
31140
Note: The sent_tokenizer
of PlaintextCorpusReader
uses the sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle')
which is the same object shared with the nltk.sent_tokenize()
function.
Voila!
Why is it that words()
don't do sentence tokenization first?
I think it's because it was originally using the WordPunctTokenizer
that doesn't need the string to be sentence tokenized first, whereas the TreebankWordTokenizer
requires the string to be tokenized first.
Why is it that in the age of "deep learning" and "machine learning", we are still using regex based tokenizers and everything else in NLP are largely based on these tokens?
I have no ideas... But there are alternatives, e.g. http://gmb.let.rug.nl/elephant/about.php