Handling \u200b (Zero width space) character in text preprocessing for NLP task

Asked 5/12, 2017 at 8:46 Answered 9/12, 2019 at 1:38

Solved python nlp removing-whitespace spacy

I'm preprocessing some text for a NER model I'm training, and I'm encountering this character quite a lot. This character is not removed with strip():

>>> 'Hello world!\u200b'.strip()
'Hello world!\u200b'

It is not considered a whitespace for regular expressions:

>>> re.sub('\s+', ' ', "hello\u200bworld!")
'hello\u200bworld!'

and spaCy's tokenizer does not split tokens upon it:

>>> [t.text for t in nlp("hello\u200bworld!")]
['hello\u200bworld', '!']

So, how should I handle it? I can simply replace it, however I don't want to make a special case for this character, but rather replace all characters with similar characteristics.

Thanks.

Hygroscopic answered 5/12, 2017 at 8:46 Comment(2)

The character's definition says it's specifically not a space or whitespace character. fileformat.info/info/unicode/char/200B/index.htm If people are using it incorrectly, it's not really well-defined exactly what to do with it. That's NLP for you ... – Chara 5/12, 2017 at 9:0

That is correct, but some of the text I'm preprocessing is extracted from a PDF using Apache Tika. – Hygroscopic 5/12, 2017 at 9:4

As you mentioned, characters like \u200b (zero-width space) and \u200c (zero-width non joiner) are not considered as a space character. So, you cannot omit such characters using techniques available for space characters. The only way, as you may have noticed, is to consider such characters as a special case.

Signalment answered 21/7, 2019 at 12:34 Comment(0)

How about simply doing string replace before NLP?

'Hello world!\u200b'.replace('\u200b', ' ').strip()

Angevin answered 9/12, 2019 at 1:38 Comment(1)

As I wrote in the question: I can simply replace it, however, I don't want to make a special case for this character, but rather replace all characters with similar characteristics – Hygroscopic 24/12, 2019 at 13:13

Recommended topics

Hot tags