First lets take a look at the count the tokens from both approach and see the most_common words:

>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk.corpus import webtext

>>> counts_from_wordtok = Counter(word_tokenize(webtext.raw('wine.txt')))
>>> counts_from_wordtok.most_common(10)
[(u'.', 2824), (u',', 1550), (u'a', 821), (u'and', 786), (u'the', 706), (u'***', 608), (u'-', 518), (u'of', 482), (u'but', 474), (u'I', 390)]

>>> counts_from_words = Counter(webtext.words('wine.txt'))
>>> counts_from_words.most_common(10)
[(u'.', 2772), (u',', 1536), (u'-', 832), (u'a', 821), (u'and', 787), (u'the', 706), (u'***', 498), (u'of', 482), (u'but', 474), (u'I', 392)]


>>> len(word_tokenize(webtext.raw('wine.txt')))
31140
>>> len(webtext.words('wine.txt'))
31350

Something smells fishy...

Lets take a closer look of how webtext interface comes about, it uses the LazyCorpusLoader at https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L235

webtext = LazyCorpusLoader(
    'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')

If we look at how PlaintextCorpusReader is loading the corpus and tokenizing https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L41

class PlaintextCorpusReader(CorpusReader):
    CorpusView = StreamBackedCorpusView

    def __init__(self, root, fileids,
                 word_tokenizer=WordPunctTokenizer(),
                 sent_tokenizer=nltk.data.LazyLoader(
                     'tokenizers/punkt/english.pickle'),
                 para_block_reader=read_blankline_block,
                 encoding='utf8'):

Ah ha! It's using the `WordPunctTokenizer` instead of the default modified `TreebankTokenizer`

The WordPunctTokenizer is a simplistic tokenizer found at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/regexp.py#L171

The word_tokenize() function is a modified TreebankTokenizer unique to NLTK https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L97

If we look at what's webtext.words() calling, we follow https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L81

def words(self, fileids=None):
    """
    :return: the given file(s) as a list of words
        and punctuation symbols.
    :rtype: list(str)
    """
    return concat([self.CorpusView(path, self._read_word_block, encoding=enc)
                   for (path, enc, fileid)
                   in self.abspaths(fileids, True, True)])

to reach _read_word_block() at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119 :

def _read_word_block(self, stream):
    words = []
    for i in range(20): # Read 20 lines at a time.
        words.extend(self._word_tokenizer.tokenize(stream.readline()))
    return words

It's reading the file line by line!

So if we load the `webtext` corpus and use the `WordPunctTokenizer` we get the same number:

>>> from nltk.corpus import webtext
>>> from nltk.tokenize import WordPunctTokenizer
>>> wpt = WordPunctTokenizer()
>>> len(wpt.tokenize(webtext.raw('wine.txt')))
31350
>>> len(webtext.words('wine.txt'))
31350

More mysteries

You can also create a new webtext corpus object by specifying the tokenizer object e.g.

>>> from nltk.tokenize import _treebank_word_tokenizer
>>> from nltk.corpus import LazyCorpusLoader, PlaintextCorpusReader
>>> from nltk.corpus import webtext

# LazyCorpusLoader expects a tokenizer object,
# but word_tokenize() is a function, so we have to 
# import the tokenizer object that word_tokenize wraps around
>>> webtext2 = LazyCorpusLoader('webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2', word_tokenizer=_treebank_word_tokenizer)

>>> len(webtext2.words('wine.txt'))
28385

>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140


>>> list(webtext2.words('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine.', u'Polished', u'leather', u'and', u'strawberries.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now.', u'***', u'Liquorice', u',', u'cherry', u'fruit.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish.', u'**', u'Thin', u'and', u'completely', u'uninspiring.', u'*', u'Rough.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great', u'complexity.', u'A', u'good', u'***', u'Charming', u',', u'violet-fragranced', u'nose.']

>>> word_tokenize(webtext2.raw('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine', u'.', u'Polished', u'leather', u'and', u'strawberries', u'.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now', u'.', u'***', u'Liquorice', u',', u'cherry', u'fruit', u'.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish', u'.', u'**', u'Thin', u'and', u'completely', u'uninspiring', u'.', u'*', u'Rough', u'.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch', u'.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great']

That's because word_tokenize does a sent_tokenize before actually tokenizing sentences into words: https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L113

But the PlaintextCorpusReader. _read_word_block() doesn't do sent_tokenize beforehand, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119

Let's do a recount with sentence tokenization first:

>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140

>>> sum(len(tokenized_sent) for tokenized_sent in webtext2.sents('wine.txt'))
31140

Note: The sent_tokenizer of PlaintextCorpusReader uses the sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle') which is the same object shared with the nltk.sent_tokenize() function.

Voila!

Why is it that words() don't do sentence tokenization first?

I think it's because it was originally using the WordPunctTokenizer that doesn't need the string to be sentence tokenized first, whereas the TreebankWordTokenizer requires the string to be tokenized first.

Why is it that in the age of "deep learning" and "machine learning", we are still using regex based tokenizers and everything else in NLP are largely based on these tokens?

I have no ideas... But there are alternatives, e.g. http://gmb.let.rug.nl/elephant/about.php

Something smells fishy...

Ah ha! It's using the `WordPunctTokenizer` instead of the default modified `TreebankTokenizer`

So if we load the `webtext` corpus and use the `WordPunctTokenizer` we get the same number:

More mysteries

Voila!

Recommended topics

Hot tags

Something smells fishy...

Ah ha! It's using the WordPunctTokenizer instead of the default modified TreebankTokenizer

So if we load the webtext corpus and use the WordPunctTokenizer we get the same number:

More mysteries

Voila!

Recommended topics

Hot tags

Ah ha! It's using the `WordPunctTokenizer` instead of the default modified `TreebankTokenizer`

So if we load the `webtext` corpus and use the `WordPunctTokenizer` we get the same number: