I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:
def remove_non_ascii_1(text):
return ''.join(i for i in text if ord(i)<128)
And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the –
character is replaced with 3 spaces):
def remove_non_ascii_2(text):
return re.sub(r'[^\x00-\x7F]',' ', text)
How can I replace all non-ASCII characters with a single space?
Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.
–
. It's this guy. – OnesidedUnicodeEncodeError
weren't needed, so your code just replaced with something more readable and feasible. Thanks – Selfpronouncingstring.encode()
. – Sacring?
instead of spaces, use something likeprint s.encode('ascii', 'replace')
=>ABRA?O JOS?
forABRAÃO JOSÉ
– Sacringsed
,awk
, orperl
? – Jackysed
,awk
, andperl
answers would be interesting even if they are OT. But I would recommend putting them all in a single "X/Y answer", not separate answers. Usually ased
,awk
, orperl
answer could replace a Python answer if the code is running from e.g. a bash CLI where all four are generally available, not where actual Python scripts are running. – Onesided