I have a Python script that loads a web page using urllib2.urlopen
, does some various magic, and spits out the results using print
. We then run the program on Windows like so:
python program.py > output.htm
Here's the problem:
The urlopen
reads data from an IIS web server which outputs UTF8. It spits out this same data to the output, however certain characters (such as the long hyphen that Word always inserts for you against your will because it's smarter than you) get garbled and end up like –
instead.
Upon further investigation, I noticed even though the web server spits out UTF8 data, the output.htm
file is encoded with the ISO-8859-1 character set.
My questions:
- When you redirect a Python program to an output file on Windows, does it always use this character set?
- If so, is there any way to change that behavior?
- If not, is there a workaround? I suppose I could just pass in
output.htm
as a command line parameter and write to that file instead of the screen, but I'd have to redo a whole bunch of logic in my program.
Thanks for any help!
UPDATE:
At the top of output.htm
I added:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
However, it makes no difference. The characters are still garbled. If I manually switch over to UTF-8 in Firefox, the file displays correctly. Both IE and FF think this file is Western ISO even though it is clearly not.
print
which is doing the encoding. The pipe or redirect is handled outside Python in Windows. – Neutrinofile
command to check the encoding of the created file (if you have access to a Unix machine). – Odessa<!DOCTYPE ...>
header in the HTML correct? However, I'm curious why Firefox even thinks it's Western ISO when it's clearly not. No BOM I'm thinking? – Abhorrent