Remove zero width space unicode character from Python string

D

6

31

I have a string in Python like this:

u'\u200cHealth & Fitness'

How can i remove the

\u200c

part from the string ?

Doityourself answered 11/9, 2017 at 11:24 Comment(3)

s.encode('utf-8') – Bolding 11/9, 2017 at 11:26

@Vinny the return string is \xe2\x80\x8cHealth & Fitness – Doityourself 11/9, 2017 at 11:27

my bad, the encoding should be ascii as Arount answered below – Bolding 11/9, 2017 at 11:48

M

54

You can encode it into ascii and ignore errors:

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

Output:

'Health & Fitness'

Monolingual answered 11/9, 2017 at 11:29 Comment(1)

This obviously works in the above example but you are forcing the string into ascii losing all unicode chars, which obviously is not a solution that works for all – Iluminadailwain 28/7, 2019 at 14:5

U

33

If you have a string that contains Unicode character, like

s = "Airports Council International \u2013 North America"

then you can try:

newString = (s.encode('ascii', 'ignore')).decode("utf-8")

and the output will be:

Airports Council International North America

Upvote if helps :)

Unreflective answered 21/2, 2018 at 7:47 Comment(2)

shouldn't we decode 'ascii' after encoding to ascii – Gibrian 5/12, 2018 at 5:37

If you have a list of strings, you can adapt this as a list comprehension: list_text_fixed = [(s.encode('ascii', 'ignore')).decode("utf-8") for s in list_text] – Trager 10/9, 2019 at 4:48

K

26

I just use replace because I don't need it:

varstring.replace('\u200c', '')

Or in your case:

u'\u200cHealth & Fitness'.replace('\u200c', '')

Kearns answered 28/3, 2019 at 15:6 Comment(3)

This is actually better than the accepted answer in most strings. The \u200c is a zero width non joiner, which is an unusual whitespace-type character that strip() ignores. In most cases with unicode strs you do not want to encode(ascii, ignore). – Marc 28/3, 2019 at 15:41

This is general solution since ascii may remove some other Unicode characters as well. – Karakalpak 3/12, 2019 at 14:31

appreciate this! – Diclinous 26/8, 2023 at 2:13

S

5

for me the following worked

mystring.encode('ascii', 'ignore').decode('unicode_escape')

Stocktonontees answered 11/12, 2018 at 10:41 Comment(2)

You could improve your answer by explaining why this code works, and what you're doing here. That way, others can be educated. – Eastman 11/12, 2018 at 13:44

tbh, that was a 'Frankenstein' version of all answers that I had previously found but which didn't work. I can't really explain why this one worked over the rest of solutions in my case.. – Stocktonontees 23/10, 2019 at 11:19

S

2

In the specific case in the question: that the string is prefixed with a single u'\200c' character, the solution is as simple as taking a slice that does not include the first character.

original = u'\u200cHealth & Fitness'
fixed = original[1:]

If the leading character may or may not be present, str.lstrip may be used

original = u'\u200cHealth & Fitness'
fixed = original.lstrip(u'\u200c')

The same solutions will work in Python3. From Python 3.9, str.removeprefix is also available

original = u'\u200cHealth & Fitness'
fixed = original.removeprefix(u'\u200c')

Siouan answered 12/1, 2021 at 17:50 Comment(0)

A

0

If the Text is just English, this way

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

BUT if such as Arabic, Persian ,... this way:

 s=s.replace('\\', '').replace('u200c', '')

If you're going to write a Text file:

import codecs
    with codecs.open('text_file.txt', 'w', encoding='utf-8') as text_file:
        for line in array_string:

            text_file.write('\u200c' + line + '\n')

Amphictyony answered 17/6 at 9:7 Comment(3)

Wouldn't it be wrong to remove it from Persian where the orthography requires it? – Apparel 17/6 at 10:24

@Andj, I checked, it kept the structure well: ‌ﺍﻭﻟﻮﯾﺖ\u200cﻫﺎﯼ ﭼﺎﭖ to ﺍﻭﻟﻮﯾﺖﻫﺎﯼ ﭼﺎﭖ , as you can see this half-space is keeping! – Amphictyony 17/6 at 10:39

except your comment is using presentation forms – Apparel 17/6 at 11:12

Recommended topics

Hot tags