Python3 Qt unicode file name problems
Asked Answered
B

3

0

Similar to

QDir and QDirIterator ignore files with non-ASCII filenames

and

UnicodeEncodeError: 'latin-1' codec can't encode character

With regard to the second link above, I added test0() below. My understanding was that utf-8 was the solution I was searching for, but alas trying to encode the filename fails.

def test0():
    print("test0...using unicode literal")
    name = u"123c\udcb4.wav"
    test("test0b",  name)

    n = name.encode('utf-8') 
    print(n)
    n = QtCore.QFile.decodeName(n)
    print(n)

# From http://docs.python.org/release/3.0.1/howto/unicode.html
# This will indeed overwrite the correct file!
#    f = open(name, 'w')
#    f.write('blah\n')
#    f.close()

Test0 results...

test0...using unicode literal
test0b QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test0b QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test0b os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True
test0b os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True

Traceback (most recent call last):
  File "unicode.py", line 157, in <module>
    test0()
  File "unicode.py", line 42, in test0
    n = name.encode('utf-8') 
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed

EDIT

Further reading from https://www.rfc-editor.org/rfc/rfc3629 tells me that "The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF". So if uft-8 doesn't allow these characters. How are you supposed to deal with a file that is so named? Python can create and test existence for them. So this points me at an issue with my Qt api usage or the Qt api itself?!

I am struggling to wrap my head around proper handling of unicode file name in Python3. Ultimately, I'm working on a Phonon based music player. I've tried to isolate the problem(s) from that as much as possible. From the code below you will see that I've tried as many alternatives as I can find. My initial response is that there are bugs here....maybe mine...maybe in one or more libraries. Any help would be much appreciated!

I have a directory with 3 unicode file names 123[abc]U.wav. The first 2 files are handled properly...mostly...the third one 123c is just wrong.

from PyQt4 import QtGui,  QtCore
import sys,  os

def test(_name,  _file):
#    print(_name,  repr(_file))
    f = QtCore.QFile(_file)
#    f = QtCore.QFile(QtCore.QFile.decodeName(test))
    exists = f.exists()
    try:
        print(_name,  "QFile.exists",  f.fileName(),  exists)
    except UnicodeEncodeError as e:
        print(e,  repr(_file),  exists)
    fileInfo = QtCore.QFileInfo(_file)
    exists = fileInfo.exists()
    try:
        print(_name,  "QFileInfo.exists",  fileInfo.fileName(),  exists)
    except UnicodeEncodeError as e:
        print(e,  repr(_file),  exists)
    exists = os.path.exists(_file)
    try:
        print(_name,  "os.path.exists",  _file,  exists)
    except UnicodeEncodeError as e:
        print(e,  repr(_file),  exists)
    exists = os.path.isfile(_file)
    try:
        print(_name,  "os.path.isfile",  _file,  exists)
    except UnicodeEncodeError as e:
        print(e,  repr(_file),  exists)
    print()

def test1():
    args = QtGui.QApplication.arguments()
    print("test1...using QtGui.QApplication.arguments()")
    test("test1",  args[1])

def test2():
    print("test2...using sys.argv")
    test("test2",  sys.argv[1])

def test3():
    print("test3...QtGui.QFileDialog.getOpenFileName()")
    name = QtGui.QFileDialog.getOpenFileName()
    test("test3",  name)

def test4():
    print("test4...QtCore.QDir().entryInfoList()")
    p = os.path.abspath(__file__)
    p,  _ = os.path.split(p)
    d = QtCore.QDir(p)
    for inf in d.entryInfoList(QtCore.QDir.AllEntries|QtCore.QDir.NoDotAndDotDot|QtCore.QDir.System):
        print("test4",  inf.fileName())
#        if str(inf.fileName()).startswith("123c"):
        if u"123c\ufffd.wav" == inf.fileName():
#        if u"123c\udcb4.wav" == inf.fileName(): # This check fails..even tho that is what is reported in error messages for test2
            test("test4a",  inf.fileName())
            test("test4b",  inf.absoluteFilePath())

def test5():
    print("test5...os.listdir()")
    p = os.path.abspath(__file__)
    p,  _ = os.path.split(p)
    dirList = os.listdir(p)
    for file in dirList:
        fullfile = os.path.join(p, file)
        try:
            print("test5",  file)
        except UnicodeEncodeError as e:
            print(e)
        print("test5",  repr(fullfile))
#        if u"123c\ufffd.wav" == file: # This check fails..even tho it worked in test4
        if u"123c\udcb4.wav" == file:
            test("test5a",  file)
            test("test5b",  fullfile)
        print()

def test6():
    print("test6...Phonon and QtGui.QFileDialog.getOpenFileName()")
    from PyQt4.phonon import Phonon

    class Window(QtGui.QDialog):
        def __init__(self):
            QtGui.QDialog.__init__(self, None)
            self.mediaObject = Phonon.MediaObject(self)
            self.audioOutput = Phonon.AudioOutput(Phonon.MusicCategory, self)
            Phonon.createPath(self.mediaObject, self.audioOutput)
            self.mediaObject.stateChanged.connect(self.handleStateChanged)

            name = QtGui.QFileDialog.getOpenFileName()# works with python3..not for 123c
#            name = QtGui.QApplication.arguments()[1] # works with python2..but not python3...not for 123c
#            name = sys.argv[1] # works with python3..but not python2...not for 123c

#            p = os.path.abspath(__file__)
#            p,  _ = os.path.split(p)
#            print(p)
#            name = os.path.join(p, str(name))

            self.mediaObject.setCurrentSource(Phonon.MediaSource(name))
            
            self.mediaObject.play()

        def handleStateChanged(self, newstate, oldstate):
            if newstate == Phonon.PlayingState:
                source = self.mediaObject.currentSource().fileName()
                print('test6 playing: :', source)
            elif newstate == Phonon.StoppedState:
                source = self.mediaObject.currentSource().fileName()
                print('test6 stopped: :', source)
            elif newstate == Phonon.ErrorState:
                source = self.mediaObject.currentSource().fileName()
                print('test6 ERROR: could not play:', source)
    win = Window()
    win.resize(200, 100)
#    win.show()
    win.exec_()

def timerTick():
    QtGui.QApplication.exit()
    
if __name__ == '__main__':

    app = QtGui.QApplication(sys.argv)
    app.setApplicationName('unicode_test')

    test1()
    test2()
    test3()
    test4()
    test5()
    test6()
    timer = QtCore.QTimer()
    timer.timeout.connect(timerTick)
    timer.start(1)
    sys.exit(app.exec_())

Test results with 123a...

python3 unicode.py 123a�.wav 
test1...using QtGui.QApplication.arguments()
test1 QFile.exists unknown False
test1 QFileInfo.exists unknown False
test1 os.path.exists unknown False
test1 os.path.isfile unknown False

test2...using sys.argv
test2 QFile.exists 123a�.wav True
test2 QFileInfo.exists 123a�.wav True
test2 os.path.exists 123a�.wav True
test2 os.path.isfile 123a�.wav True

test3...QtGui.QFileDialog.getOpenFileName()
test3 QFile.exists /home/mememe/Desktop/test/unicode/123a�.wav True
test3 QFileInfo.exists 123a�.wav True
test3 os.path.exists /home/mememe/Desktop/test/unicode/123a�.wav True
test3 os.path.isfile /home/mememe/Desktop/test/unicode/123a�.wav True

test4...QtCore.QDir().entryInfoList()
test4 123a�.wav
test4 123bÆ.wav
test4 123c�.wav
test4a QFile.exists 123c�.wav False
test4a QFileInfo.exists 123c�.wav False
test4a os.path.exists 123c�.wav False
test4a os.path.isfile 123c�.wav False

test4b QFile.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test4b QFileInfo.exists 123c�.wav False
test4b os.path.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test4b os.path.isfile /home/mememe/Desktop/test/unicode/123c�.wav False

test4 unicode.py
test5...os.listdir()
test5 unicode.py
test5 '/home/mememe/Desktop/test/unicode/unicode.py'

test5 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed
test5 '/home/mememe/Desktop/test/unicode/123c\udcb4.wav'
test5a QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test5a QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test5a os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True
test5a os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True

test5b QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' False
test5b QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' False
test5b os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' True
test5b os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' True


test5 123bÆ.wav
test5 '/home/mememe/Desktop/test/unicode/123bÆ.wav'

test5 123a�.wav
test5 '/home/mememe/Desktop/test/unicode/123a�.wav'

test6...Phonon and QtGui.QFileDialog.getOpenFileName()
test6 stopped: : /home/mememe/Desktop/test/unicode/123a�.wav
test6 playing: : /home/mememe/Desktop/test/unicode/123a�.wav
test6 stopped: : /home/mememe/Desktop/test/unicode/123a�.wav

Test results with 123b...

python3 unicode.py 123bÆ.wav 
test1...using QtGui.QApplication.arguments()
test1 QFile.exists 123b.wav False
test1 QFileInfo.exists 123b.wav False
test1 os.path.exists 123b.wav False
test1 os.path.isfile 123b.wav False

test2...using sys.argv
test2 QFile.exists 123bÆ.wav True
test2 QFileInfo.exists 123bÆ.wav True
test2 os.path.exists 123bÆ.wav True
test2 os.path.isfile 123bÆ.wav True

test3...QtGui.QFileDialog.getOpenFileName()
test3 QFile.exists /home/mememe/Desktop/test/unicode/123bÆ.wav True
test3 QFileInfo.exists 123bÆ.wav True
test3 os.path.exists /home/mememe/Desktop/test/unicode/123bÆ.wav True
test3 os.path.isfile /home/mememe/Desktop/test/unicode/123bÆ.wav True

test4...QtCore.QDir().entryInfoList()
test4 123a�.wav
test4 123bÆ.wav
test4 123c�.wav
test4a QFile.exists 123c�.wav False
test4a QFileInfo.exists 123c�.wav False
test4a os.path.exists 123c�.wav False
test4a os.path.isfile 123c�.wav False

test4b QFile.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test4b QFileInfo.exists 123c�.wav False
test4b os.path.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test4b os.path.isfile /home/mememe/Desktop/test/unicode/123c�.wav False

test4 unicode.py
test5...os.listdir()
test5 unicode.py
test5 '/home/mememe/Desktop/test/unicode/unicode.py'

test5 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed
test5 '/home/mememe/Desktop/test/unicode/123c\udcb4.wav'
test5a QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test5a QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test5a os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True
test5a os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True

test5b QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' False
test5b QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' False
test5b os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' True
test5b os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' True


test5 123bÆ.wav
test5 '/home/mememe/Desktop/test/unicode/123bÆ.wav'

test5 123a�.wav
test5 '/home/mememe/Desktop/test/unicode/123a�.wav'

test6...Phonon and QtGui.QFileDialog.getOpenFileName()
test6 stopped: : /home/mememe/Desktop/test/unicode/123bÆ.wav
test6 playing: : /home/mememe/Desktop/test/unicode/123bÆ.wav
test6 stopped: : /home/mememe/Desktop/test/unicode/123bÆ.wav

Test results with 123c...

python3 unicode.py 123c�.wav 
test1...using QtGui.QApplication.arguments()
test1 QFile.exists unknown False
test1 QFileInfo.exists unknown False
test1 os.path.exists unknown False
test1 os.path.isfile unknown False

test2...using sys.argv
test2 QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test2 QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test2 os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True
test2 os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True

test3...QtGui.QFileDialog.getOpenFileName()
test3 QFile.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test3 QFileInfo.exists 123c�.wav False
test3 os.path.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test3 os.path.isfile /home/mememe/Desktop/test/unicode/123c�.wav False

test4...QtCore.QDir().entryInfoList()
test4 123a�.wav
test4 123bÆ.wav
test4 123c�.wav
test4a QFile.exists 123c�.wav False
test4a QFileInfo.exists 123c�.wav False
test4a os.path.exists 123c�.wav False
test4a os.path.isfile 123c�.wav False

test4b QFile.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test4b QFileInfo.exists 123c�.wav False
test4b os.path.exists /home/mememe/Desktop/test/unicode/123c�.wav False
test4b os.path.isfile /home/mememe/Desktop/test/unicode/123c�.wav False

test4 unicode.py
test5...os.listdir()
test5 unicode.py
test5 '/home/mememe/Desktop/test/unicode/unicode.py'

test5 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed
test5 '/home/mememe/Desktop/test/unicode/123c\udcb4.wav'
test5a QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test5a QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' False
test5a os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True
test5a os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '123c\udcb4.wav' True

test5b QFile.exists 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' False
test5b QFileInfo.exists 'utf-8' codec can't encode character '\udcb4' in position 4: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' False
test5b os.path.exists 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' True
test5b os.path.isfile 'utf-8' codec can't encode character '\udcb4' in position 38: surrogates not allowed '/home/mememe/Desktop/test/unicode/123c\udcb4.wav' True


test5 123bÆ.wav
test5 '/home/mememe/Desktop/test/unicode/123bÆ.wav'

test5 123a�.wav
test5 '/home/mememe/Desktop/test/unicode/123a�.wav'

test6...Phonon and QtGui.QFileDialog.getOpenFileName()
test6 stopped: : /home/mememe/Desktop/test/unicode/123c�.wav

Interesting things to note about the test results...

  • Test1 failed for all 3 files.
  • Test2 passed for all 3 files...except for the QFile and QFileInfo tests for 123c
  • Test3 passed for 123a and 123b but failed for 123c
  • Test4 ...QDir found all 4 files in the directory
  • Test4a and Test4b failed for all files
  • Test5 ...os.listdir found all 4 files in the directory
  • NOTE: The Test5a and test5b checks had to use a different unicode check?!
  • Test5a and Test5b failed the QFile and QfileInfo tests, but passed the os.path checks.
  • Test6 passed for 123a and 123b, but failed for 123c...the phonon player got a stopped only message vs the stopped playing stopped the 123a and 123b files got.

I know that is a lot of information...I wast trying to be thorough.

So, if there is one final question is what is the right way to deal with unicode file names in Python3?

Bickerstaff answered 9/10, 2013 at 22:49 Comment(0)
D
2

You're right, 123c is just wrong. The evidence shows that the filename on disk contains an invalid Unicode codepoint U+DCB4. When Python tries to print that character, it rightly complains that it can't. When Qt processes the character in test4 it can't handle it either, but instead of throwing an error it converts it to the Unicode REPLACEMENT CHARACTER U+FFFD. Obviously the new filename no longer matches what's on disk.

Python can also use the replacement character in a string instead of throwing an error if you do the conversion yourself and specify the proper error handling. I don't have Python 3 on hand to test this but I think it will work:

filename = filename.encode('utf-8').decode('utf-8', 'replace')
Dyad answered 10/10, 2013 at 15:0 Comment(7)
Python3 will happily allow... name = u"123c\udcb4.wav" test("test0b", name) f = open(name, 'w') f.write('blah\n') f.close() ` ...which will create an invalid utf-8 named file. Should it?Bickerstaff
@shao.lo, I don't know all the details but I suspect Python doesn't care about invalid characters until it tries to encode or decode them. Whether an encode or decode takes place for opening a file will depend on the OS and Python internals. It might work on Windows but fail on Linux or vice versa. You might even see a difference between Python 3.2 and 3.3.Dyad
So the moral of the story is that not all unicode strings can be represented in utf-8. Whatever oddball encoding that \udcb4 character came from while valid for that encoding..it can not be converted to utf-8. Any API that relies on utf-8 will not handle such file names. This apparently includes Qt.Bickerstaff
All unicode strings can be represented in utf-8. But a file name can contain character sequences that cannot be converted to unicode. Therefore there is no string that can produce this filename by encoding to utf-8. You can't create a file with such name using software that works with unicode. But it's technically possible if you use non-unicode programs. File with such name becomes inaccessible for most unicode-based programs.Limburger
@shao.lo, your conclusion is a little off. Any encoding can be converted to utf-8, but first it must be converted from whatever encoding it originated in. Unicode wasn't designed to completely cover every bit pattern you might feed it.Dyad
So does the Python literal u"123c\udcb4.wav" not represent a valid unicode string? If so, how would you know what encoding it originated in?Bickerstaff
@shao.lo, right, it's not a valid Unicode string. To get the original encoding you can make an educated guess based on where the file originated or write the string to a file then look here: #91338Dyad
K
1

Codes like "\udcb4" come from surrogate escape. It's a way for Python to preserve bytes that cannot be interpreted as valid UTF-8. When encoded to UTF-8, surrogates are turned into bytes without the 0xDC byte, so "\udcb4" becomes 0xB4. Surrogate escape makes it possible to deal with any byte sequences in file names. But you need to be careful to use errors="surrogateescape" as documented in the Unicode HOWTO https://docs.python.org/3/howto/unicode.html

Kneedeep answered 9/1, 2015 at 16:46 Comment(0)
B
0

Python2 vs Python3

python
Python 2.7.4 (default, Sep 26 2013, 03:20:56) 
>>> import os
>>> os.listdir('.')
['unicode.py', '123c\xb4.wav', '123b\xc3\x86.wav', '123a\xef\xbf\xbd.wav']
>>> os.path.exists(u'123c\xb4.wav')
False
>>> os.path.exists('123c\xb4.wav')
True

>>> n ='123c\xb4.wav'
>>> print(n)
123c�.wav
>>> n =u'123c\xb4.wav'
>>> print(n)
123c´.wav

That backtick on the last line above is what I've been looking for! ..vs that �

The same directory listed with Python3 shows a different set of filenames

python3
Python 3.3.1 (default, Sep 25 2013, 19:30:50) 
>>> import os
>>> os.listdir('.')
['unicode.py', '123c\udcb4.wav', '123bÆ.wav', '123a�.wav']
>>> os.path.exists('123c\udcb4.wav')
True

Is this a bug in Python3?

Bickerstaff answered 10/10, 2013 at 22:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.