Remove zero width space unicode character from Python string
Asked Answered
D

6

31

I have a string in Python like this:

u'\u200cHealth & Fitness'

How can i remove the

\u200c

part from the string ?

Doityourself answered 11/9, 2017 at 11:24 Comment(3)
s.encode('utf-8')Bolding
@Vinny the return string is \xe2\x80\x8cHealth & FitnessDoityourself
my bad, the encoding should be ascii as Arount answered belowBolding
M
54

You can encode it into ascii and ignore errors:

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

Output:

'Health & Fitness'
Monolingual answered 11/9, 2017 at 11:29 Comment(1)
This obviously works in the above example but you are forcing the string into ascii losing all unicode chars, which obviously is not a solution that works for allIluminadailwain
U
33

If you have a string that contains Unicode character, like

s = "Airports Council International \u2013 North America"

then you can try:

newString = (s.encode('ascii', 'ignore')).decode("utf-8")

and the output will be:

Airports Council International North America

Upvote if helps :)

Unreflective answered 21/2, 2018 at 7:47 Comment(2)
shouldn't we decode 'ascii' after encoding to asciiGibrian
If you have a list of strings, you can adapt this as a list comprehension: list_text_fixed = [(s.encode('ascii', 'ignore')).decode("utf-8") for s in list_text]Trager
K
26

I just use replace because I don't need it:

varstring.replace('\u200c', '')

Or in your case:

u'\u200cHealth & Fitness'.replace('\u200c', '')
Kearns answered 28/3, 2019 at 15:6 Comment(3)
This is actually better than the accepted answer in most strings. The \u200c is a zero width non joiner, which is an unusual whitespace-type character that strip() ignores. In most cases with unicode strs you do not want to encode(ascii, ignore).Marc
This is general solution since ascii may remove some other Unicode characters as well.Karakalpak
appreciate this!Diclinous
S
5

for me the following worked

mystring.encode('ascii', 'ignore').decode('unicode_escape')
Stocktonontees answered 11/12, 2018 at 10:41 Comment(2)
You could improve your answer by explaining why this code works, and what you're doing here. That way, others can be educated.Eastman
tbh, that was a 'Frankenstein' version of all answers that I had previously found but which didn't work. I can't really explain why this one worked over the rest of solutions in my case..Stocktonontees
S
2

In the specific case in the question: that the string is prefixed with a single u'\200c' character, the solution is as simple as taking a slice that does not include the first character.

original = u'\u200cHealth & Fitness'
fixed = original[1:]

If the leading character may or may not be present, str.lstrip may be used

original = u'\u200cHealth & Fitness'
fixed = original.lstrip(u'\u200c')

The same solutions will work in Python3. From Python 3.9, str.removeprefix is also available

original = u'\u200cHealth & Fitness'
fixed = original.removeprefix(u'\u200c')
Siouan answered 12/1, 2021 at 17:50 Comment(0)
A
0

If the Text is just English, this way

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

BUT if such as Arabic, Persian ,... this way:

 s=s.replace('\\', '').replace('u200c', '')

If you're going to write a Text file:

import codecs
    with codecs.open('text_file.txt', 'w', encoding='utf-8') as text_file:
        for line in array_string:

            text_file.write('\u200c' + line + '\n')
Amphictyony answered 17/6 at 9:7 Comment(3)
Wouldn't it be wrong to remove it from Persian where the orthography requires it?Apparel
@Andj, I checked, it kept the structure well: ‌ﺍﻭﻟﻮﯾﺖ\u200cﻫﺎﯼ ﭼﺎﭖ to ﺍﻭﻟﻮﯾﺖﻫﺎﯼ ﭼﺎﭖ , as you can see this half-space is keeping!Amphictyony
except your comment is using presentation formsApparel

© 2022 - 2024 — McMap. All rights reserved.