Python's os.path choking on Hebrew filenames
Asked Answered
C

4

14

I'm writing a script that has to move some file around, but unfortunately it doesn't seem os.path plays with internationalization very well. When I have files named in Hebrew, there are problems. Here's a screenshot of the contents of a directory:

alt text
(source: thegreenplace.net)

Now consider this code that goes over the files in this directory:

files = os.listdir('test_source')

for f in files:
    pf = os.path.join('test_source', f)
    print pf, os.path.exists(pf)

The output is:

test_source\ex True
test_source\joe True
test_source\mie.txt True
test_source\__()'''.txt True
test_source\????.txt False

Notice how os.path.exists thinks that the hebrew-named file doesn't even exist? How can I fix this?

ActivePython 2.5.2 on Windows XP Home SP2

Choanocyte answered 30/1, 2009 at 21:3 Comment(0)
C
17

Hmm, after some digging it appears that when supplying os.listdir a unicode string, this kinda works:

files = os.listdir(u'test_source')

for f in files:

    pf = os.path.join(u'test_source', f)
    print pf.encode('ascii', 'replace'), os.path.exists(pf)

===>

test_source\ex True
test_source\joe True
test_source\mie.txt True
test_source\__()'''.txt True
test_source\????.txt True

Some important observations here:

  • Windows XP (like all NT derivatives) stores all filenames in unicode
  • os.listdir (and similar functions, like os.walk) should be passed a unicode string in order to work correctly with unicode paths. Here's a quote from the aforementioned link:

os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames.

  • And lastly, print wants an ascii string, not unicode, so the path has to be encoded to ascii.
Choanocyte answered 30/1, 2009 at 21:40 Comment(3)
print doesn't seem to be picky about ascii on all environments though. See my answer.Coburn
print has no problem in printing unicode: the problem may be in the stdout encoding. If the console is unicode there is no problem, otherwise an explicit encode is required.Isomerous
That's excellent. Should mean that you can report sane file names on Windows to if you print to a file handle with the right encoding set. The 'replace' error handler just signals defeat to me. =)Coburn
L
3

It looks like a Unicode vs ASCII issue - os.listdir is returning a list of ASCII strings.

Edit: I tried it on Python 3.0, also on XP SP2, and os.listdir simply omitted the Hebrew filenames instead of listing them at all.

According to the docs, this means it was unable to decode it:

Note that when os.listdir() returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising UnicodeError.

Lublin answered 30/1, 2009 at 21:25 Comment(1)
I guess I could try, but it won't help me as I can't move to 3.0 at the moment. I'm sure there should be a solution for 2.5Choanocyte
C
1

It works like a charm using Python 2.5.1 on OS X:

subdir/bar.txt True
subdir/foo.txt True
subdir/עִבְרִית.txt True

Maybe that means that this has to do with Windows XP somehow?

EDIT: I also tried with unicode strings to try mimic the Windows behaviour better:

for f in os.listdir(u'subdir'):
  pf = os.path.join(u'subdir', f)
  print pf, os.path.exists(pf)

subdir/bar.txt True
subdir/foo.txt True
subdir/עִבְרִית.txt True

In the Terminal (os x stock command prompt app) that is. Using IDLE it still worked but didn't print the filename correctly. To make sure it really is unicode there I checked:

>>>os.listdir(u'listdir')[2]
u'\u05e2\u05b4\u05d1\u05b0\u05e8\u05b4\u05d9\u05ea.txt'
Coburn answered 30/1, 2009 at 21:38 Comment(3)
i think it has to do with the fact that Windows stores all filenames in Unicode. see my own partial answerChoanocyte
curious. if I just pass pf to print, it throws an encoding exception. it must be expecting asciiChoanocyte
I think piro has it nailed down (in the comment to another answer on this question). It's the encoding of stdout.Coburn
R
0

A question mark is the more or less universal symbol displayed when a unicode character can't be represented in a specific encoding. Your terminal or interactive session under Windows is probably using ASCII or ISO-8859-1 or something. So the actual string is unicode, but it gets translated to ???? when printed to the terminal. That's why it works for PEZ, using OSX.

Rohr answered 30/1, 2009 at 22:27 Comment(1)
can I make the windoze terminal display unicode? what does OSX do to show it nicely?Choanocyte

© 2022 - 2024 — McMap. All rights reserved.