Modify python nltk.word_tokenize to exclude "#" as delimiter

Asked 27/2, 2016 at 19:3 Answered 31/12, 2018 at 20:24

I am using Python's NLTK library to tokenize my sentences.

If my code is

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

I get this as my output

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?

I want my output to be

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

I want C# to be considered as one token.

Margie answered 27/2, 2016 at 19:3 Comment(0)

Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

Zoltai answered 29/2, 2016 at 20:57 Comment(1)

A slight modification needs to be done. This is because in every iteration where we encounter #, the length of tokens reduces because we join 2 tokens. – Margie 8/3, 2016 at 9:48

As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer:

mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)

Bora answered 31/12, 2018 at 20:24 Comment(1)

This is exactly what I was looking for. Thank you – Wareing 8/2, 2019 at 6:34

Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

Zoltai answered 29/2, 2016 at 20:57 Comment(1)

A slight modification needs to be done. This is because in every iteration where we encounter #, the length of tokens reduces because we join 2 tokens. – Margie 8/3, 2016 at 9:48

NLTK uses regular expressions to tokenize text, so you could use its regexp tokenizer to define your own regexp.

I'll create an example for you where text will be split on any space character (tab, new line, ecc) and a couple of other symbols just for instance:

>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']

Zoltai answered 29/2, 2016 at 11:54 Comment(5)

Note that the sentence is not correctly tokenized, if you want it to be you'll need a more complex expression. You can look at the sorce code for examples. – Zoltai 29/2, 2016 at 11:56

Thanks! Where can I find the source code for the function word_tokenize? I want to perform the same function as word_tokenize but, I want to remove # as one of the delimiters. – Margie 29/2, 2016 at 19:50

This is the folder where everything concerning tokenizers is, and that function in particular is at line 94 in file __init__.py. Unfortunately it is not using a regexp, but if you give me a sec I'll find out something better and edit this comment. – Zoltai 29/2, 2016 at 20:45

Check out the other (better) idea :) – Zoltai 29/2, 2016 at 20:57

Based on what the questioner desires to achive, the "gaps=True" should be removed. – Divider 8/11, 2020 at 19:27

Recommended topics

Hot tags