Python Convert punycode back to unicode
Asked Answered
C

1

8

I'm trying to add contacts to Sendgrid from a db which occasionally is storing the user email in punycode [email protected] which translates to example-email@yahóo.com in Unicode.

Anyway if I try and add the ascii version there's an error because sendgrid doesn't accept it - however it does accept the Unicode version.

So is there a way to convert them in python.

So I think long story short is there a way to decode punycode to Unicode?

Edit

As suggested in comments i tried 'example-email@yahóo.com'.encode('punycode').decode() which returns [email protected] so this is incorrect outside of python so is not a valid solution.

Thanks in advance.

Coo answered 19/4, 2022 at 8:47 Comment(4)
What's wrong on the punycode encoding (see reference at Python Specific Encodings)? 'example-email@yahóo.com'.encode('punycode').decode() returns [email protected] and vice versa: '[email protected]'.encode().decode('punycode') -> example-email@yahóo.com.Flatulent
That does work thanks however it's a different format completely, probably because it's a python specific encoding. so still leaves the problem of how to get the correct format?Coo
xn--yah-sqa.com isn't a valid Punycode/IDN string. You could verify at Punycode converter: yahóo.com translates to Punycode as xn--yaho-sqa.com.Flatulent
Yes forgive the typo but this enforces my point that doing it in python is completely different as using the python punycode email@yahóo.com'.encode('punycode').decode() returns [email protected] when really it's xn--yaho-sqa.com i'm looking for Also will edit the original questionCoo
F
11

There is the xn-- ACE prefix in your encoded e-mail address:

The ACE prefix for IDNA is "xn--" or any capitalization thereof.

So apply the idna encoding (see Python Specific Encodings):

codec idna Implement RFC 3490, see also encodings.idna. Only errors='strict' is supported.

Result:

'yahóo.com'.encode('idna').decode()
# 'xn--yaho-sqa.com'

and vice versa:

'xn--yaho-sqa.com'.encode().decode('idna')
# 'yahóo.com'

You could use the idna library instead:

Support for the Internationalised Domain Names in Applications (IDNA) protocol as specified in RFC 5891. This is the latest version of the protocol and is sometimes referred to as “IDNA 2008”.

This library also provides support for Unicode Technical Standard 46, Unicode IDNA Compatibility Processing.

This acts as a suitable replacement for the “encodings.idna” module that comes with the Python standard library, but which only supports the older superseded IDNA specification (RFC 3490).

Flatulent answered 19/4, 2022 at 9:15 Comment(4)
encode('punycode') works even better and supported decodeAvoid
@Avoid yes, punycode works however gives slightly different results: ['yahóo.com'.encode( 'idna').decode(), 'yahóo.com'.encode( 'punycode').decode()] returns following: ['xn--yaho-sqa.com', 'yaho.com-x3a']Flatulent
yes, you need to prepped "xn--", or remove the last "-" if the old_text + "-" == new_text. Then there are no symbols to encode.Avoid
At least the built-in "idna" doesn't work as goodAvoid

© 2022 - 2024 — McMap. All rights reserved.