Java Can't Open a File with Surrogate Unicode Values in the Filename?
Asked Answered
H

4

12

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:

"草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif

If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

Most of the code I am actually dealing with are of the form:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

Is there some way I can address this problem, either escaping the filenames or opening files differently?

Houchens answered 9/10, 2009 at 19:21 Comment(4)
What's the value of Charset.defaultCharset() in your environment?Paraffinic
(Unfortunately, StackOverflow also has a problem with surrogates, and has stripped the U+26FF6 ideograph from the question)Carvajal
Can you provide what System.getProperty("file.encoding") returns? Try changing your encoding java -dfile.encoding=ENCODING_GOES_HERE if does nor work change your system locale. If this also does nor work we will wait for an expert to solve it.Fanfaronade
The charset and file encoding are both UTF-8Houchens
C
7

I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:

\xF0\xA6\xBF\xB6

it outputs a UTF-8-encoded sequence for each of the surrogates:

\xED\xA1\x9B\xED\xBF\xB6

This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.

So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:

$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')

On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:

['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:

['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:

os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')
Carvajal answered 9/10, 2009 at 20:31 Comment(3)
Not really a bug as it's part of the spec (even if it is often confusing.)Unfortunate
The result of the python commands was the proper filename you listed first, so it must be Java not playing nice.Houchens
Oh, that's unfortunate. Even if you detected the broken-CESU-8 situation, I can't think of any way to work around it and get a byte-oriented filename interface. :-( You might have to explicitly disallow the surrogates until such time as Sun fix it. How poor.Carvajal
F
5

If your environment's default locale does not include those characters you cannot open the file.

See: File.exists() fails with unicode characters in name

Edit: Alright.. What you need is to change the system locale. Whatever OS you are using.

Edit:

See: How can I open files containing accents in Java?

See: JFileChooser on Mac cannot see files named by Chinese chars?

Fanfaronade answered 9/10, 2009 at 19:35 Comment(3)
Is it not possible to do this without changing the system locale? The program I am building will need to run on any locale, and I should be able to input these characters and deal with these files even in a US/English locale.Houchens
Bad solution - because app runned on users, wich not sitting on my computer. And have different locale, and they do not have rigth administrator to do this.Kindling
AFAIK there is no other solution. This limitation comes with Sun/Oracle Java. You can try JFileChooser if displaying a save dialog to your users is OK for you.Fanfaronade
H
3

This turned out to be a problem with the Mac JVM (tested on 1.5 and 1.6). Filenames containing supplementary characters / surrogate pairs cannot be accessed with the Java File class. I ended up writing a JNI library with Carbon calls for the Mac version of the project (ick). I suspect the CESU-8 issue bobince mentioned, as the JNI call to get UTF-8 characters returned a CESU-8 string. Doesn't look like it's something you can really get around.

Houchens answered 25/11, 2009 at 21:5 Comment(0)
E
0

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...

...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

Euroclydon answered 24/2, 2014 at 12:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.