I am using Python's NLTK library to tokenize my sentences.
If my code is
text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)
I get this as my output
['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
The symbols ;
, .
, #
are considered as delimiters. Is there a way to remove #
from the set of delimiters like how +
isn't a delimiter and thus C++
appears as a single token?
I want my output to be
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']
I want C#
to be considered as one token.
#
, the length oftokens
reduces because we join 2 tokens. – Margie