NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)
Asked Answered
O

1

5

I have 2 sentences in my dataset:

w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period

When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I.

Here is word tokenize

>>> nltk.word_tokenize(w1, 'english')
['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute']
>>> nltk.word_tokenize(w2, 'english')
['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute']

and sent tokenize

>>> nltk.sent_tokenize(w1, 'english')
['I am Pusheen the cat.I am so cute']
>>> nltk.sent_tokenize(w2, 'english')
['I am Pusheen the cat.', 'I am so cute']

I would like to ask how to fix that ? i.e: make nlkt detect as w2 while in my dataset, sometime word and punctuation are stick together.

Update: Tried Stanford CoreNLP 3.7.0, they also cannot distinct 'cat.I' as 'cat', '.', 'I'

meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
I
am
Pusheen
the
cat.I
am
so
cute
.
PTBTokenizer tokenized 9 tokens at 111.21 tokens per second.
Ozzy answered 1/7, 2017 at 8:4 Comment(1)
Disregarding everything staring with '#', w1 is one sentence, with a word which contains a period (like e.g. an email address would. So there is no case of not being ble to distinct two sentences as there is only one.Wince
M
8

It's implemented this way on purpose -- a period with no space after it usually doesn't signify the end of a sentence (think about the periods in phrases such as "version 4.3", "i.e.", "A.M.", etc.). If you have a corpus in which ends of sentences with no space after the full stop is a common occurrence, you'll have to preprocess the text with a regular expression or some such before sending it to NLTK.

A good rule-of-thumb might be that usually a lowercase letter followed by a period followed by an uppercase letter usually signifies the end of a sentence. To insert a space after the period in such cases, you could use a regular expression, e.g.

import re
w1 = re.sub(r'([a-z])\.([A-Z])', r'\1. \2', w1)

where

  • [a-z] matches any lowercase character
  • \\. matches the full stop
  • [A-Z] matches any uppercase character
  • \1 is a reference to the first group in (parentheses)
  • \2 is a reference to the second group in (parentheses)
Midstream answered 1/7, 2017 at 11:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.