NLTK words vs word_tokenize
Asked Answered
S

1

5

I'm exploring some of NLTK's corpora and came across the following behaviour: word_tokenize() and words produce different sets of words().

Here is an example using webtext:

from nltk.corpus import webtext

When I run the following,

len(set(word_tokenize(webtext.raw('wine.txt'))))

I get: 3488

When I run the following,

len(set(webtext.words('wine.txt')))

I get: 3414

All I can find in the documentation is that word_tokenize is a list of punctuation and words. But it also says words is a list of punctuation and words. I'm wondering, what's going on here? Why are they different?

I've already tried looking at the set differences.

U = set(word_tokenize(webtext.raw('wine.txt')))
V = set(webtext.words('wine.txt'))

tok_not_in_words = U.difference(V) # in tokenize but not in words
words_not_in_tok = V.difference(U) # in words but not in tokenize

All I can see is that word_tokenize contains hyphenated words and words splits the hyphenated words.

Any help is appreciated. Thank you!

Superbomb answered 27/10, 2017 at 0:5 Comment(1)
Very good question!! I'll get to the answer when I'm free later if no one else has given an answer.Tien
T
6

First lets take a look at the count the tokens from both approach and see the most_common words:

>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk.corpus import webtext

>>> counts_from_wordtok = Counter(word_tokenize(webtext.raw('wine.txt')))
>>> counts_from_wordtok.most_common(10)
[(u'.', 2824), (u',', 1550), (u'a', 821), (u'and', 786), (u'the', 706), (u'***', 608), (u'-', 518), (u'of', 482), (u'but', 474), (u'I', 390)]

>>> counts_from_words = Counter(webtext.words('wine.txt'))
>>> counts_from_words.most_common(10)
[(u'.', 2772), (u',', 1536), (u'-', 832), (u'a', 821), (u'and', 787), (u'the', 706), (u'***', 498), (u'of', 482), (u'but', 474), (u'I', 392)]


>>> len(word_tokenize(webtext.raw('wine.txt')))
31140
>>> len(webtext.words('wine.txt'))
31350

Something smells fishy...

Lets take a closer look of how webtext interface comes about, it uses the LazyCorpusLoader at https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L235

webtext = LazyCorpusLoader(
    'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')

If we look at how PlaintextCorpusReader is loading the corpus and tokenizing https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L41

class PlaintextCorpusReader(CorpusReader):
    CorpusView = StreamBackedCorpusView

    def __init__(self, root, fileids,
                 word_tokenizer=WordPunctTokenizer(),
                 sent_tokenizer=nltk.data.LazyLoader(
                     'tokenizers/punkt/english.pickle'),
                 para_block_reader=read_blankline_block,
                 encoding='utf8'):

Ah ha! It's using the WordPunctTokenizer instead of the default modified TreebankTokenizer

The WordPunctTokenizer is a simplistic tokenizer found at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/regexp.py#L171

The word_tokenize() function is a modified TreebankTokenizer unique to NLTK https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L97

If we look at what's webtext.words() calling, we follow https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L81

def words(self, fileids=None):
    """
    :return: the given file(s) as a list of words
        and punctuation symbols.
    :rtype: list(str)
    """
    return concat([self.CorpusView(path, self._read_word_block, encoding=enc)
                   for (path, enc, fileid)
                   in self.abspaths(fileids, True, True)])

to reach _read_word_block() at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119 :

def _read_word_block(self, stream):
    words = []
    for i in range(20): # Read 20 lines at a time.
        words.extend(self._word_tokenizer.tokenize(stream.readline()))
    return words

It's reading the file line by line!

So if we load the webtext corpus and use the WordPunctTokenizer we get the same number:

>>> from nltk.corpus import webtext
>>> from nltk.tokenize import WordPunctTokenizer
>>> wpt = WordPunctTokenizer()
>>> len(wpt.tokenize(webtext.raw('wine.txt')))
31350
>>> len(webtext.words('wine.txt'))
31350

More mysteries

You can also create a new webtext corpus object by specifying the tokenizer object e.g.

>>> from nltk.tokenize import _treebank_word_tokenizer
>>> from nltk.corpus import LazyCorpusLoader, PlaintextCorpusReader
>>> from nltk.corpus import webtext

# LazyCorpusLoader expects a tokenizer object,
# but word_tokenize() is a function, so we have to 
# import the tokenizer object that word_tokenize wraps around
>>> webtext2 = LazyCorpusLoader('webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2', word_tokenizer=_treebank_word_tokenizer)

>>> len(webtext2.words('wine.txt'))
28385

>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140


>>> list(webtext2.words('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine.', u'Polished', u'leather', u'and', u'strawberries.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now.', u'***', u'Liquorice', u',', u'cherry', u'fruit.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish.', u'**', u'Thin', u'and', u'completely', u'uninspiring.', u'*', u'Rough.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great', u'complexity.', u'A', u'good', u'***', u'Charming', u',', u'violet-fragranced', u'nose.']

>>> word_tokenize(webtext2.raw('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine', u'.', u'Polished', u'leather', u'and', u'strawberries', u'.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now', u'.', u'***', u'Liquorice', u',', u'cherry', u'fruit', u'.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish', u'.', u'**', u'Thin', u'and', u'completely', u'uninspiring', u'.', u'*', u'Rough', u'.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch', u'.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great']

That's because word_tokenize does a sent_tokenize before actually tokenizing sentences into words: https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L113

But the PlaintextCorpusReader. _read_word_block() doesn't do sent_tokenize beforehand, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119

Let's do a recount with sentence tokenization first:

>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140

>>> sum(len(tokenized_sent) for tokenized_sent in webtext2.sents('wine.txt'))
31140

Note: The sent_tokenizer of PlaintextCorpusReader uses the sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle') which is the same object shared with the nltk.sent_tokenize() function.

Voila!

Why is it that words() don't do sentence tokenization first?

I think it's because it was originally using the WordPunctTokenizer that doesn't need the string to be sentence tokenized first, whereas the TreebankWordTokenizer requires the string to be tokenized first.

Why is it that in the age of "deep learning" and "machine learning", we are still using regex based tokenizers and everything else in NLP are largely based on these tokens?

I have no ideas... But there are alternatives, e.g. http://gmb.let.rug.nl/elephant/about.php

Tien answered 15/11, 2017 at 8:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.