How do I post non-ASCII characters using httplib when content-type is "application/xml"
Asked Answered
T

4

7

I've implemented a Pivotal Tracker API module in Python 2.7. The Pivotal Tracker API expects POST data to be an XML document and "application/xml" to be the content type.

My code uses urlib/httplib to post the document as shown:

    request = urllib2.Request(self.url, xml_request.toxml('utf-8') if xml_request else None, self.headers)
    obj = parse_xml(self.opener.open(request))

This yields an exception when the XML text contains non-ASCII characters:

File "/usr/lib/python2.7/httplib.py", line 951, in endheaders
  self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 809, in _send_output
  msg += message_body
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 89: ordinal not in range(128)

As near as I can see, httplib._send_output is creating an ASCII string for the message payload, presumably because it expects the data to be URL encoded (application/x-www-form-urlencoded). It works fine with application/xml as long as only ASCII characters are used.

Is there a straightforward way to post application/xml data containing non-ASCII characters or am I going to have to jump through hoops (e.g. using Twistd and a custom producer for the POST payload)?

Taurine answered 3/11, 2011 at 10:15 Comment(0)
P
8

You're mixing Unicode and bytestrings.

>>> msg = u'abc' # Unicode string
>>> message_body = b'\xc5' # bytestring
>>> msg += message_body
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal \
not in range(128)

To fix it, make sure that self.headers content is properly encoded i.e., all keys, values in the headers should be bytestrings:

self.headers = dict((k.encode('ascii') if isinstance(k, unicode) else k,
                     v.encode('ascii') if isinstance(v, unicode) else v)
                    for k,v in self.headers.items())

Note: character encoding of the headers has nothing to do with a character encoding of a body i.e., xml text can be encoded independently (it is just an octet stream from http message's point of view).

The same goes for self.url—if it has the unicode type; convert it to a bytestring (using 'ascii' character encoding).


HTTP message consists of a start-line, "headers", an empty line and possibly a message-body so self.headers is used for headers, self.url is used for start-line (http method goes here) and probably for Host http header (if client is http/1.1), XML text goes to message body (as binary blob).

It is always safe to use ASCII encoding for self.url (IDNA can be used for non-ascii domain names—the result is also ASCII).

Here's what rfc 7230 says about http headers character encoding:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

To convert XML to a bytestring, see application/xml encoding condsiderations:

The use of UTF-8, without a BOM, is RECOMMENDED for all XML MIME entities.

Porker answered 3/11, 2011 at 10:31 Comment(7)
Perhaps you could change the content-type of the headers, but how does that fix the issue? The msg gets constructed in the python libraries, and is byte string.Voidable
@jro: It has nothing to do with HTTP. Look at the complete example above.Porker
I get that this causes the issue, but my point was that he has no control over the msg variable. I agree with your point, but my question is more in the line of how can this fact help him to solve it when in the libs msg is created as msg = "\r\n".join(self._buffer)?Voidable
@jro: Look at urllib2.Request(.. line in the question. There is self.headers. I've added code to the answer that ensures that it doesn't have Unicode strings.Porker
Ok, I guess I'm missing something then. The question states that it "yields an exception when the XML text contains non-ASCII characters"... so, the data is the issue, not the headers I'd say. Looking through the sources I didn't see any connection to the headers solving this issue. But, lets not make this a slow chat: I'll wait to see if this solves his issue, and then dive in the the sources/docs to find out why. Thanks for taking the time to elaborate, though :). Appreciated.Voidable
Actually it turns out that the problem was self.url, which in certain circumstances was Unicode. Thanks for the tip!Taurine
Just read the rest of the comments. To clarify, the message is constructed in httplib from the method, URL, headers, etc. If any of these is Unicode, the whole string gets converted to Unicode (I presume this is normal Python behavior). Then if you try to append a UTF-8 string you get the error I described in the original question.Taurine
W
2

Check if the self.url is unicode. If it is unicode, then httplib will treat the data as unicode.

you could force encode self.url to unicode, then httplib will treat all data as unicode

Warwick answered 9/6, 2013 at 6:30 Comment(0)
M
1

Same as JF Sebastian answer, but I'm adding a new one so the code formatting works (and is more google-able)

Here's what happens if you're trying to tag on to the end of a mechanize form request:

br = mechanize.Browser()
br.select_form(nr=0)
br['form_thingy'] = u"Wonderful"
headers = dict((k.encode('ascii') if isinstance(k, unicode) else k, v.encode('ascii') if isinstance(v, unicode) else v) for k,v in br.request.headers.items())
br.addheaders = headers
req = br.submit()
Meganmeganthropus answered 16/4, 2016 at 15:33 Comment(0)
M
0

There are 3 things to be covered here

  • Non Unicode string + Unicode string, the result will be converted into a Unicode string automatically.
  • Python 2.7 httplib, simply uses + to join header with body which I don't think is a good practice, we should not trust the automatic type converting. but Python 2.6 httplib is different.
  • HTTP protocol standard suggests ISO-8859-1 encoding for header, but if you want to put non ISO-8859-1 characters, you have to encode it as rfc2047 described

The simple solution is to strictly encoding both header and body to utf-8 before sending out.

Magnetomotive answered 4/7, 2015 at 10:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.