I have 2 sentences in my dataset:
w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period
When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I.
Here is word tokenize
>>> nltk.word_tokenize(w1, 'english')
['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute']
>>> nltk.word_tokenize(w2, 'english')
['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute']
and sent tokenize
>>> nltk.sent_tokenize(w1, 'english')
['I am Pusheen the cat.I am so cute']
>>> nltk.sent_tokenize(w2, 'english')
['I am Pusheen the cat.', 'I am so cute']
I would like to ask how to fix that ? i.e: make nlkt detect as w2 while in my dataset, sometime word and punctuation are stick together.
Update: Tried Stanford CoreNLP 3.7.0, they also cannot distinct 'cat.I' as 'cat', '.', 'I'
meow@meow-server:~/projects/stanfordcorenlp$ java edu.stanford.nlp.process.PTBTokenizer sample.txt
I
am
Pusheen
the
cat.I
am
so
cute
.
PTBTokenizer tokenized 9 tokens at 111.21 tokens per second.