I'm trying to download a binary file using XMLHttpRequest
(using a recent Webkit) and base64-encode its contents using this simple function:
function getBinary(file){
var xhr = new XMLHttpRequest();
xhr.open("GET", file, false);
xhr.overrideMimeType("text/plain; charset=x-user-defined");
xhr.send(null);
return xhr.responseText;
}
function base64encode(binary) {
return btoa(unescape(encodeURIComponent(binary)));
}
var binary = getBinary('http://some.tld/sample.pdf');
var base64encoded = base64encode(binary);
As a side note, everything above is standard Javascript stuff, including btoa()
and encodeURIComponent()
: https://developer.mozilla.org/en/DOM/window.btoa
This works pretty smoothly, and I can even decode the base64 contents using Javascript:
function base64decode(base64) {
return decodeURIComponent(escape(atob(base64)));
}
var decodedBinary = base64decode(base64encoded);
decodedBinary === binary // true
Now, I want to decode the base64-encoded contents using Python which consume some JSON string to get the base64encoded
string value. Naively this is what I do:
import urllib
import base64
# ... retrieving of base64 encoded string through JSON
base64 = "77+9UE5HDQ……………oaCgA="
source_contents = urllib.unquote(base64.b64decode(base64))
destination_file = open(destination, 'wb')
destination_file.write(source_contents)
destination_file.close()
But the resulting file is invalid, looks like the operation's messaed up with UTF-8, encoding or something which is still unclear to me.
If I try to decode UTF-8 contents before putting them in the destination file, an error is raised:
import urllib
import base64
# ... retrieving of base64 encoded string through JSON
base64 = "77+9UE5HDQ……………oaCgA="
source_contents = urllib.unquote(base64.b64decode(base64)).decode('utf-8')
destination_file = open(destination, 'wb')
destination_file.write(source_contents)
destination_file.close()
$ python test.py
// ...
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
As a side note, here's a screenshot of two textual representations of a same file; on left: the original; on right: the one created from the base64-decoded string: http://cl.ly/0U3G34110z3c132O2e2x
Is there a known trick to circumvent these problems with encoding when attempting to recreating the file? How would you achieve this yourself?
Any help or hint much appreciated :)
codecs
module for writing the destination file using the 'utf-8' codec with no luck as well, but I might have messed up something somewhere. – Polyandristbase64encode()
function I'm using is unable to convert some characters… The strange thing is that the reverse operation works perfectly in javascript… – Polyandristbtoa()
,encodeURIComponent()
andunescape()
) are standard. Same by the Python part, nothing else than stdlib stuff used… I'll investigate with the strange Bytes values put this looks to be a real pain :( – Polyandrist