How to know where to join by space in spaCy NLP output
Asked Answered
A

1

7

I am using spaCys NLP model to work out the POS of input data so that the my Markov chains can be a bit more gramatically correct as with the example in the python markovify library found here. However the way that spaCy splits tokens makes it difficult when reconstructing them because certain grammatical elements are also split up for example "don't" becomes ["do", "n't"]. This means that you can't rejoin generated Markov chains simply by space anymore but need to know if the tokens make up one word.

I assumed that the is_left_punct and is_right_punct properties of tokens might relate to this but it doesn't seem to be related. My current code simply accounts for PUNCT tokens but the do n't problem persists.

Is there a property of the tokens that I can use to tell the method that joins sentences together when to omit spaces or some other way to know this?

Amoretto answered 3/4, 2019 at 16:55 Comment(0)
F
7

Spacy tokens have a whitespace_ attribute which is always set.

You can always use that as it will represent actual spaces when they were present, or be an empty string when it was not.

This occurs in cases like you mentioned, when the tokenisation splits a continuous string.

So Token("do").whitespace_ will be the empty string.

For example

[bool(token.whitespace_) for token in nlp("don't")]

Should produce

[False, False]
Foolery answered 3/4, 2019 at 17:47 Comment(4)
Thanks very much. Just a note for whoever uses this. It's more useful to know if a space goes before the word rather than after so after I nlp a sentence I shift all the whitespace_ attributes to the left for generating markov chains. Thanks again.Amoretto
@Amoretto — I'm struggling with implementing your method. Would you mind posting a simple example (assuming spaCy/markovify)?Cullet
i'll try to help, can you explain exactly what you want?Foolery
I'm doing this for training: str(bool(word.whitespace_)))) and this for generation: if whitespace == "True": sentence += f"{word} " Does this make sense? Does the whitespace property need to be converted to string to train the markov model? @NathanMcCoyCullet

© 2022 - 2024 — McMap. All rights reserved.