I am using spaCys NLP model to work out the POS of input data so that the my Markov chains can be a bit more gramatically correct as with the example in the python markovify library found here. However the way that spaCy splits tokens makes it difficult when reconstructing them because certain grammatical elements are also split up for example "don't"
becomes ["do", "n't"]
. This means that you can't rejoin generated Markov chains simply by space anymore but need to know if the tokens make up one word.
I assumed that the is_left_punct
and is_right_punct
properties of tokens might relate to this but it doesn't seem to be related. My current code simply accounts for PUNCT
tokens but the do n't
problem persists.
Is there a property of the tokens that I can use to tell the method that joins sentences together when to omit spaces or some other way to know this?
nlp
a sentence I shift all thewhitespace_
attributes to the left for generating markov chains. Thanks again. – Amoretto