Peter Piper piped a Python program - and lost all his unicode characters
Asked Answered
A

3

16

I have a Python script that loads a web page using urllib2.urlopen, does some various magic, and spits out the results using print. We then run the program on Windows like so:

python program.py > output.htm

Here's the problem:

The urlopen reads data from an IIS web server which outputs UTF8. It spits out this same data to the output, however certain characters (such as the long hyphen that Word always inserts for you against your will because it's smarter than you) get garbled and end up like – instead.

Upon further investigation, I noticed even though the web server spits out UTF8 data, the output.htm file is encoded with the ISO-8859-1 character set.

My questions:

  1. When you redirect a Python program to an output file on Windows, does it always use this character set?
  2. If so, is there any way to change that behavior?
  3. If not, is there a workaround? I suppose I could just pass in output.htm as a command line parameter and write to that file instead of the screen, but I'd have to redo a whole bunch of logic in my program.

Thanks for any help!

UPDATE:

At the top of output.htm I added:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

However, it makes no difference. The characters are still garbled. If I manually switch over to UTF-8 in Firefox, the file displays correctly. Both IE and FF think this file is Western ISO even though it is clearly not.

Abhorrent answered 6/1, 2012 at 16:48 Comment(10)
It's not a pipe. It's redirection. And it's print which is doing the encoding. The pipe or redirect is handled outside Python in Windows.Neutrino
If it ends up "garbled" like you say, then the output is UTF-8; whatever you're viewing the file in is interpreting it as ISO-8859-1. That is to say, does the resulting HTML file have an XML prolog stating the encoding, or a meta tag specifying the Content-Type?Jeannettajeannette
Well that's not very alliterative..Abhorrent
I've tried viewing the file in both Firefox and IE, both garble the character. I'm checking if the output file has any sort of BOM.Abhorrent
@MikeChristensen: In Firefox, you can manually set the encoding to UTF-8. Does that make the characters appear correctly? You can also use the Linux file command to check the encoding of the created file (if you have access to a Unix machine).Odessa
@NiklasBaumstark - Yes, in Firefox if I manually switch to UTF8 then it displays correctly. So, I could fix this with a <!DOCTYPE ...> header in the HTML correct? However, I'm curious why Firefox even thinks it's Western ISO when it's clearly not. No BOM I'm thinking?Abhorrent
@Mike Christensen: Exactly. No BOM, no HTTP header (because you don't GET it from a webserver) and no character set in the HTML file. This means that FiFo just uses the default encoding (which seems to be latin1 in your case). I added this as an answer.Odessa
First step would be to hexdump the file to see if it actually is utf8 or not. (most of the posters here are implying that the file is Ok, but is being misinterpreted by the reader)Tickler
@Tickler - Yea, looking at the raw hex would be key. I have a feeling the file has UTF8 bytes but is not marked as such. This seems to be default Windows behavior from what I can tell, thus the two solutions would be to override the IIS content-type header or just add the meta tag to the HTML. I have chosen the latter.Abhorrent
IMNSVHO, BOM is mostly bullshit and only there to confuse people. Tools can lie to you, or misinterpret data without saying. hexdump does not lie.Tickler
O
8

From your comments and question update it seems that the data is correctly encoded in UTF-8. This means you just need to tell your browser it's UTF-8, either by using a BOM, or better, by adding encoding information to your HTML document:

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

You really shouldn't use an XML declaration if the document is no valid XML.

The best and most reliable way would be to serve the file via HTTP and set the Content-Type: header appropriately.

Odessa answered 6/1, 2012 at 17:15 Comment(4)
That did it, thanks! So, I'm guessing when I redirect the Python script on Windows, it uses the default Windows codepage (Western ISO) even though Python is spitting out UTF8 bytes. Thus, the file has a Western ISO BOM with UTF8 bytes. If I load the file in IE or FF, it sees the BOM and uses that since no meta tag overrode it. If I serve up the file in IIS, IIS probably detects this as well and sets the content-type header to Western ISO. I think using the meta tag is the best fix in this situation, so +1 for your answer.Abhorrent
@Mike: Not quite, latin1/ISO-8859-1 is an extension to ASCII and has no BOM. Your script works perfectly fine and the redirection is also okay. What went wrong is that either your web server served the wrong Content-Type, because your document had no UTF-8 BOM, or that the web server specified no encoding at all and the browser just used its default encoding, because it wasn't told any better. By the way, this is not a Windows-specific issue and could have gone wrong in a similar way on Linux.Odessa
Oh in that case I bet the file has no BOM at all. Python probably doesn't just output BOMs to the screen, and Windows probably doesn't add them to files created with >Abhorrent
@Mike: I can only repeat: Windows has nothing to do with it. It just opens the file and passes the file handle on to your Python script. Also, using a BOM for UTF-8 is neither necessary nor recommended (it breaks ASCII compatibility), so Python is right not to output it.Odessa
S
5

When you pipe a Python program to an output file on Windows, does it always use this character set?

Default encoding used to output to pipe. On my machine:

In [5]: sys.getdefaultencoding()
Out[5]: 'ascii'

If not, is there a workaround?

import sys
try:
    sys.setappdefaultencoding('utf-8')
except:
    sys = reload(sys)
    sys.setdefaultencoding('utf-8')

Now all output is encoded to 'utf-8'.

I think correct way to handle this situation without

redo a whole bunch of logic

is to decode all data from your internet source from server or page encoding to unicode, and then to use workaround shown above to set default encoding to utf-8.

Sela answered 6/1, 2012 at 17:1 Comment(0)
M
2

Most programs under Windows will assume that you're using the default Windows encoding, which will be ISO-8859-1 for an English installation. This goes for the command window output as well. There's no way to set the default encoding to UTF-8 unfortunately - there's a code page defined for it, but it's not well supported.

Some editors will recognize any BOM characters at the start of the file and switch to UTF-8, but that's not guaranteed.

If you're generating HTML you should include the proper charset tag; then the browser will interpret it properly.

Malfeasance answered 6/1, 2012 at 17:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.