In Python 3, how do you remove all non-UTF8 characters from a string?
Asked Answered
J

1

5

I'm using Python 3.7. How do I remove all non-UTF-8 characters from a string? I tried using "lambda x: x.decode('utf-8','ignore').encode("utf-8")" in the below

coop_types = map(
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
    filter(None, set(d['type'] for d in input_file))
)

but this is resulting in the error ...

Traceback (most recent call last):
  File "scripts/parse_coop_csv.py", line 30, in <module>
    for coop_type in coop_types:
  File "scripts/parse_coop_csv.py", line 25, in <lambda>
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
AttributeError: 'str' object has no attribute 'decode'

If you have a generic way to remove all non-UTF8 chars from a string, that's all I'm looking for.

Jenicejeniece answered 28/1, 2020 at 16:17 Comment(2)
You first encode x, then decode it. str.encode takes a Unicode string and produces a UTF-8 encoding of it. bytes.decode takes a string and attempts interpret it as an encoding to produce a str object.Che
Can you give an example of what would be a non-UTF-8 character in an instance of str? Do you mean surrogate code points?Amylase
U
7

You're starting with a string. You can't decode a str (it's already decoded text, you can only encode it to binary data again). UTF-8 encodes almost any valid Unicode text (which is what str stores) so this shouldn't come up much, but if you're encountering surrogate characters in your input, you could just reverse the directions, changing:

x.decode('utf-8','ignore').encode("utf-8")

to:

x.encode('utf-8','ignore').decode("utf-8")

where you encode any UTF-8 encodable thing, discarding the unencodable stuff, then decode the now clean UTF-8 bytes.

Uranian answered 28/1, 2020 at 16:20 Comment(4)
Side-note: If the problem is surrogates, you may not want to discard them; you may just need to accept them properly (e.g. via json.loads or the like) in the first place, so you never actually see them, you just see the single Unicode character they represent.Uranian
so long as you're familiar with your input data and outcome of loosing chars beyond byte 127, then this is a great choice - perhaps one of the simplest I've found in this topic. good job, @UranianAldrin
@NathanBenton: To be clear, this doesn't lose all characters beyond byte 127 (if you used 'ascii' as the encoding instead of 'utf-8' it would). UTF-8 handles all normal Unicode ordinals, just not high-low surrogates (a UTF-16 thing that doesn't apply to UTF-8).Uranian
got it - thank you for the feedback and correction, @UranianAldrin

© 2022 - 2024 — McMap. All rights reserved.