get indices of original text from nltk word_tokenize
Asked Answered
G

3

11

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

import nltk
x = 'hello world'
tokens = nltk.word_tokenize(x)
>>> ['hello', 'world']

How can I also get the array [0, 7] corresponding to the raw indices of the tokens?

Gismo answered 28/7, 2015 at 6:5 Comment(0)
G
15

I think you are looking for is the span_tokenize() method. Apparently this is not supported by the default tokenizer. Here is a code example with another tokenizer.

from nltk.tokenize import WhitespaceTokenizer
s = "Good muffins cost $3.88\nin New York."
span_generator = WhitespaceTokenizer().span_tokenize(s)
spans = [span for span in span_generator]
print(spans)

Which gives:

[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]

just getting the offsets:

offsets = [span[0] for span in spans]
[0, 5, 13, 18, 24, 27, 31]

For further information (on the different tokenizers available) see the tokenize api docs

Gallonage answered 28/7, 2015 at 9:55 Comment(1)
I added a span_tokenizer to the TreebankWordTokenizer here: gist.github.com/ckoppelman/c93e4192d9f189fba590e095258f8f33. Any help or advice is appreciated.Skyeskyhigh
T
15

You can also do this:

def spans(txt):
    tokens=nltk.word_tokenize(txt)
    offset = 0
    for token in tokens:
        offset = txt.find(token, offset)
        yield token, offset, offset+len(token)
        offset += len(token)


s = "And now for something completely different and."
for token in spans(s):
    print token
    assert token[0]==s[token[1]:token[2]]

And get:

('And', 0, 3)
('now', 4, 7)
('for', 8, 11)
('something', 12, 21)
('completely', 22, 32)
('different', 33, 42)
('.', 42, 43)
Takamatsu answered 20/11, 2016 at 4:47 Comment(2)
That won't work. word_tokenize function may replace token's text for something else - for example, " (double quote) for `` (two backticks). So your call txt.find(token, offset) returns -1.Hyperbaric
replace nltk.word_tokenize(txt) with the customized function word_tokenize(txt). It should work. def word_tokenize(tokens): return [token.replace("''", '"').replace("``", '"') for token in nltk.word_tokenize(tokens)] (sorry, I don't know how to change line in comments.)Hiroshige
T
0

pytokenizations have a useful function get_original_spans to get the spans:

# $ pip install pytokenizations
import tokenizations
tokens = ["hello", "world"]
text = "Hello world"
tokenizations.get_original_spans(tokens, text)
>>> [(0,5), (6,11)]

This function can handle noisy texts:

tokens = ["a", "bc"]
original_text = "å\n \tBC"
tokenizations.get_original_spans(tokens, original_text)
>>> [(0,1), (4,6)]

See the documentation for other useful functions.

Treble answered 28/5, 2020 at 6:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.