Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?
Asked Answered
K

4

42

Python's urllib.quote and urllib.unquote do not handle Unicode correctly in Python 2.6.5. This is what happens:

In [5]: print urllib.unquote(urllib.quote(u'Cataño'))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

/home/kkinder/<ipython console> in <module>()

/usr/lib/python2.6/urllib.pyc in quote(s, safe)
   1222             safe_map[c] = (c in safe) and c or ('%%%02X' % i)
   1223         _safemaps[cachekey] = safe_map
-> 1224     res = map(safe_map.__getitem__, s)
   1225     return ''.join(res)
   1226 

KeyError: u'\xc3'

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

It's recognized as a bug and there is a fix, but not for my version of Python.

What I'd like is something similar to urllib.quote/urllib.unquote, but handles unicode variables correctly, such that this code would work:

decode_url(encode_url(u'Cataño')) == u'Cataño'

Any recommendations?

Kettle answered 5/4, 2011 at 20:8 Comment(3)
Luckily, it seems the OP has somehow got confused: as the traceback shows, this is really 2.6.Windstorm
I don't know what's happening on your end, but I pasted your quote/unquote example verbatim into my interpreter python2.6, and it correctly printed Cataño.Imbed
Ah, nm, bobince already answered that below.Imbed
M
47

Python's urllib.quote and urllib.unquote do not handle Unicode correctly

urllib does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.

IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib.

Encoding the value to UTF8 also does not work:

In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño

Ah, well now you're typing Unicode into a console, and doing print-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.

Type it out the long way with backslash sequences and you can more easily see that the urllib bit does actually work:

>>> u'Cata\u00F1o'.encode('utf-8')
'Cata\xC3\xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'

>>> urllib.unquote(_)
'Cata\xC3\xB1o'
>>> _.decode('utf-8')
u'Cata\xF1o'
Mera answered 9/4, 2011 at 13:37 Comment(1)
Actually the problem was that I never decoded the URL when I was testing with UTF8. Simple mistake.Kettle
M
5

"""Encoding the value to UTF8 also does not work""" ... the result of your code is a str object which at a guess appears to be the input encoded in UTF-8. You need to decode it or define "does not work" -- what do you expect?

Note: So that we don't need to guess the encoding of your terminal and the type of your data, use print repr(whatever) instead of print whatever.

>>> # Python 2.6.6
... from urllib import quote, unquote
>>> s = u"Cata\xf1o"
>>> q = quote(s.encode('utf8'))
>>> u = unquote(q).decode('utf8')
>>> for x in (s, q, u):
...     print repr(x)
...
u'Cata\xf1o'
'Cata%C3%B1o'
u'Cata\xf1o'
>>>

For comparison:

>>> # Python 3.2
... from urllib.parse import quote, unquote
>>> s = "Cata\xf1o"
>>> q = quote(s)
>>> u = unquote(q)
>>> for x in (s, q, u):
...     print(ascii(x))
...
'Cata\xf1o'
'Cata%C3%B1o'
'Cata\xf1o'
>>>
Madness answered 5/4, 2011 at 20:41 Comment(3)
Very simply, I expect the result from unqoute to be what I sent to quote(). I figured out that urllib is basically expecting a latin1 encoding.Kettle
@Ken: I would expect that latin1 is accidental rather than an expectation. In any case, latin won't handle your problem in general. You should also expect that the result of quote() will give the "right" answer -- hence my comparison with Python 3.2. Python 2.6.6 quote using latin1 instead of utf8 produces 'Cata%F1o'Madness
This totally solved my problem: q = quote(s.encode('utf8'))Molten
B
2

I encountered the same problem and used a helper function to deal with non-ascii and urllib.urlencode function (which includes quote and unquote):

def utf8_urlencode(params):
    import urllib as u
    # problem: u.urlencode(params.items()) is not unicode-safe. Must encode all params strings as utf8 first.
    # UTF-8 encodes all the keys and values in params dictionary
    for k,v in params.items():
        # TRY urllib.unquote_plus(artist.encode('utf-8')).decode('utf-8')
        if type(v) in (int, long, float):
            params[k] = v
        else:
            try:
                params[k.encode('utf-8')] = v.encode('utf-8')
            except Exception as e:
                logging.warning( '**ERROR utf8_urlencode ERROR** %s' % e )
    return u.urlencode(params.items()).decode('utf-8')

adopted from Unicode URL encode / decode with Python

Boston answered 4/6, 2015 at 15:49 Comment(0)
B
1

So I had the same problem: I wanted to put query parameters in an url, but some of them contained weird characters (diacritics).

Dealing with encoding gave a messy url and was fragile.

My solution was to replace every accent/weird unicode character to its ascii equivalent. It's straightforward thanks to unidecode: What is the best way to remove accents in a Python unicode string?

pip install unidecode

then

from unidecode import unidecode
print unidecode(u"éèê") 
# prints eee

so I have a clean url. Also works for chinese etc.

Babylonia answered 4/9, 2015 at 17:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.