TL;DR
It seems to be a bug/feature of the TweetTokenizer()
which we're unsure what motivates this.
Read on to find out where the bug/feature occurs...
In Long
Looking at the tokenize()
function in TweetTokenizer, before the actual tokenizing, the tokenizer does some preprocessing:
First, it remove entities from text by converting them to their corresponding unicode character through the _replace_html_entities()
function
Optionally, it removes username handles using the remove_handles()
function.
Optionally, it normalize the word length through the reduce_lengthening function
Then, shortens the problematic sequences of characters using the HANG_RE
regex
Lastly, the actual tokenization takes place through the WORD_RE
regex
After the WORD_RE
regex, it
- optionally preserve the case of emoticons before lowercasing the tokenized output
In code:
def tokenize(self, text):
"""
:param text: str
:rtype: list(str)
:return: a tokenized list of strings; concatenating this list returns\
the original string if `preserve_case=False`
"""
# Fix HTML character entities:
text = _replace_html_entities(text)
# Remove username handles
if self.strip_handles:
text = remove_handles(text)
# Normalize word lengthening
if self.reduce_len:
text = reduce_lengthening(text)
# Shorten problematic sequences of characters
safe_text = HANG_RE.sub(r'\1\1\1', text)
# Tokenize:
words = WORD_RE.findall(safe_text)
# Possibly alter the case, but avoid changing emoticons like :D into :d:
if not self.preserve_case:
words = list(map((lambda x : x if EMOTICON_RE.search(x) else
x.lower()), words))
return words
By default, the handle stripping and length reduction doesn't kick in, unless specified by user.
class TweetTokenizer:
r"""
Tokenizer for tweets.
>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
>>> tknzr.tokenize(s0)
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
Examples using `strip_handles` and `reduce_len parameters`:
>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
>>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
>>> tknzr.tokenize(s1)
[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
"""
def __init__(self, preserve_case=True, reduce_len=False, strip_handles=False):
self.preserve_case = preserve_case
self.reduce_len = reduce_len
self.strip_handles = strip_handles
Let's go through the steps and regexes:
>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'
Checked, _replace_html_entities()
isn't the culprit.
By default, remove_handles()
and reduce_lengthening()
is skipped but for sanity sake, let's see:
>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'
>>> from nltk.tokenize.casual import remove_handles, reduce_lengthening
>>> remove_handles(_replace_html_entities(s))
u'the 231358523423423421162 of 3151942776...'
>>> reduce_lengthening(remove_handles(_replace_html_entities(s)))
u'the 231358523423423421162 of 3151942776...'
Checked too, neither of the optional functions are behaving badly
>>> import re
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> HANG_RE = re.compile(r'([^a-zA-Z0-9])\1{3,}')
>>> HANG_RE.sub(r'\1\1\1', s)
'the 231358523423423421162 of 3151942776...'
Klar! The HANG_RE
is cleared of its name too
>>> import re
>>> from nltk.tokenize.casual import REGEXPS
>>> WORD_RE = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> WORD_RE.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']
Achso! That's where the splits appear!
Now let's look deeper into the WORD_RE
, it's a tuple of regexes.
The first is a massive URL pattern regex from https://gist.github.com/winzig/8894715
Let's go through them one by one:
>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> patt.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:1]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
[]
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']
Ah ha! It seems like the 2nd regex from REGEXPS
is causing the problem!!
If we look at https://github.com/alvations/nltk/blob/develop/nltk/tokenize/casual.py#L122:
# The components of the tokenizer:
REGEXPS = (
URLS,
# Phone numbers:
r"""
(?:
(?: # (international)
\+?[01]
[\-\s.]*
)?
(?: # (area code)
[\(]?
\d{3}
[\-\s.\)]*
)?
\d{3} # exchange
[\-\s.]*
\d{4} # base
)"""
,
# ASCII Emoticons
EMOTICONS
,
# HTML tags:
r"""<[^>\s]+>"""
,
# ASCII Arrows
r"""[\-]+>|<[\-]+"""
,
# Twitter username:
r"""(?:@[\w_]+)"""
,
# Twitter hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# email addresses
r"""[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-]"""
,
# Remaining word types:
r"""
(?:[^\W\d_](?:[^\W\d_]|['\-_])+[^\W\d_]) # Words with apostrophes or dashes.
|
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
|
(?:[\w_]+) # Words without apostrophes or dashes.
|
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
|
(?:\S) # Everything else that isn't whitespace.
"""
)
The second regex from REGEXP tries to parse numbers as phone-numbers:
# Phone numbers:
r"""
(?:
(?: # (international)
\+?[01]
[\-\s.]*
)?
(?: # (area code)
[\(]?
\d{3}
[\-\s.\)]*
)?
\d{3} # exchange
[\-\s.]*
\d{4} # base
)"""
The pattern tries to recognize
- Optionally, the first digits will be matched as international code.
- the next 3 digits as the area code
- optionally followed by a dash
- then 3 more digits which is the (telecom) exchange code
- another optional dash
- lastly 4 digit base phone number.
See https://regex101.com/r/BQpnsg/1 for a detailed explanation.
That's why it's trying to split contiguous digits up into 10 digits block!!
But note the quirks, since the phone number regex is hard coded, it is possible to catch real phone numbers in the \d{3}-d{3}-\d{4}
or \d{10}
patterns, but if the dashes are in other order, it won't work:
>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> s = '231-358-523423423421162'
>>> patt.findall(s)
['231-358-5234', '2342342116']
>>> s = '2313-58-523423423421162'
>>> patt.findall(s)
['5234234234']
Can we fix it?
See https://github.com/nltk/nltk/issues/1799