How do I post unicode characters using httplib?
Asked Answered
T

1

8

I try to post unicode data with the httplib.request function:

s = u"עברית"
data = """
<spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
<text>%s</text>
</spellrequest>
""" % s

con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=he", data)
response = con.getresponse().read()

However this is my error:

Traceback (most recent call last):
  File "C:\Scripts\iQuality\test.py", line 47, in <module>
    print spellFix(u"╫á╫נ╫¿╫ץ╫ר╫ץ")
  File "C:\Scripts\iQuality\test.py", line 26, in spellFix
    con.request("POST", "/tbproxy/spell?lang=%s" % lang, data)
  File "C:\Python27\lib\httplib.py", line 955, in request
    self._send_request(method, url, body, headers)
  File "C:\Python27\lib\httplib.py", line 989, in _send_request
    self.endheaders(body)
  File "C:\Python27\lib\httplib.py", line 951, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 815, in _send_output
    self.send(message_body)
  File "C:\Python27\lib\httplib.py", line 787, in send
    self.sock.sendall(data)
  File "C:\Python27\lib\ssl.py", line 220, in sendall
    v = self.send(data[count:])
  File "C:\Python27\lib\ssl.py", line 189, in send
    v = self._sslobj.write(data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 97-102: or
dinal not in range(128)

Where am I wrong?

Trigeminal answered 14/4, 2012 at 0:28 Comment(0)
L
9

http is not defined in terms of a particular character encoding, and instead uses octets. You need to convert your data to an encoding, and then you need to tell the server which encoding you have used. Lets use utf8, since it's usually the best choice:

This data looks a bit like XML, but you are skipping the xml tag. Some services may accept that, but you shouldn't anyways. In fact, the encoding actually belongs there; so make sure you include it. The heading looks like <?xml version="1.0" encoding="encoding"?>.

s = u"עברית"
data_unicode = u"""<?xml version="1.0" encoding="UTF-8"?>
<spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
<text>%s</text>
</spellrequest>
""" % s

data_octets = data_unicode.encode('utf-8')

As a matter of courtesy, you should also tell the server itself the format and encoding, with the content-type header:

con = httplib.HTTPSConnection("www.google.com")
con.request("POST",
            "/tbproxy/spell?lang=he", 
            data_octets, {'content-type': 'text/xml; charset=utf-8'})

EDIT: It's working fine on my machine, are you sure you're not skipping something? full example

>>> from cgi import escape
>>> from urllib import urlencode
>>> import httplib
>>> 
>>> template = u"""<?xml version="1.0" encoding="UTF-8"?>
... <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
... <text>%s</text>
... </spellrequest>
... """
>>> 
>>> def chkspell(word, lang='en'):
...     data_octets = (template % escape(word)).encode('utf-8')
...     con = httplib.HTTPSConnection("www.google.com")
...     con.request("POST",
...         "/tbproxy/spell?" + urlencode({'lang': lang}),
...         data_octets,
...         {'content-type': 'text/xml; charset=utf-8'})
...     req = con.getresponse()
...     return req.read()
... 
>>> chkspell('baseball')
'<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="8"></spellresult>'
>>> chkspell(corpus, 'he')
'<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="5"></spellresult>'

I did notice that when I pasted your example, it appears in the opposite order on my terminal from how it shows in my browser. Not too surprising considering Hebrew is a right-to-left language.

>>> corpus = u"עברית"
>>> print corpus[0]
ע
Liberal answered 14/4, 2012 at 0:58 Comment(4)
Omitting the XML Declaration is fine. You only need it when you want a non-UTF encoding or XML 1.1.Claude
Google actually returns an error if you send the XML declartion.Trigeminal
@iTayb: What does the error look like? It worked fine on my machine.Liberal
@TokenMacGuy this is the answer I get back: <?xml version="1.0" encoding="UTF-8"?><spellresult error="1"/>Trigeminal

© 2022 - 2024 — McMap. All rights reserved.