In Python 3, how do you remove all non-UTF8 characters from a string?

About

Asked 28/1, 2020 at 16:17 Answered 28/1, 2020 at 16:20

Solved python python-3.x utf-8 decode encode

I'm using Python 3.7. How do I remove all non-UTF-8 characters from a string? I tried using "lambda x: x.decode('utf-8','ignore').encode("utf-8")" in the below

coop_types = map(
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
    filter(None, set(d['type'] for d in input_file))
)

but this is resulting in the error ...

Traceback (most recent call last):
  File "scripts/parse_coop_csv.py", line 30, in <module>
    for coop_type in coop_types:
  File "scripts/parse_coop_csv.py", line 25, in <lambda>
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
AttributeError: 'str' object has no attribute 'decode'

If you have a generic way to remove all non-UTF8 chars from a string, that's all I'm looking for.

Jenicejeniece answered 28/1, 2020 at 16:17 Comment(2)

You first encode x, then decode it. str.encode takes a Unicode string and produces a UTF-8 encoding of it. bytes.decode takes a string and attempts interpret it as an encoding to produce a str object. – Che 28/1, 2020 at 16:19

Can you give an example of what would be a non-UTF-8 character in an instance of str? Do you mean surrogate code points? – Amylase 28/1, 2020 at 17:28

You're starting with a string. You can't decode a str (it's already decoded text, you can only encode it to binary data again). UTF-8 encodes almost any valid Unicode text (which is what str stores) so this shouldn't come up much, but if you're encountering surrogate characters in your input, you could just reverse the directions, changing:

x.decode('utf-8','ignore').encode("utf-8")

to:

x.encode('utf-8','ignore').decode("utf-8")

where you encode any UTF-8 encodable thing, discarding the unencodable stuff, then decode the now clean UTF-8 bytes.

Uranian answered 28/1, 2020 at 16:20 Comment(4)

Side-note: If the problem is surrogates, you may not want to discard them; you may just need to accept them properly (e.g. via json.loads or the like) in the first place, so you never actually see them, you just see the single Unicode character they represent. – Uranian 28/1, 2020 at 18:45

so long as you're familiar with your input data and outcome of loosing chars beyond byte 127, then this is a great choice - perhaps one of the simplest I've found in this topic. good job, @Uranian – Aldrin 3/5, 2021 at 22:25

@NathanBenton: To be clear, this doesn't lose all characters beyond byte 127 (if you used 'ascii' as the encoding instead of 'utf-8' it would). UTF-8 handles all normal Unicode ordinals, just not high-low surrogates (a UTF-16 thing that doesn't apply to UTF-8). – Uranian 3/5, 2021 at 22:55

got it - thank you for the feedback and correction, @Uranian – Aldrin 3/5, 2021 at 22:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags