How can I open files containing accents in Java?
Asked Answered
A

6

13

(editing for clarification and adding some code)

Hello, We have a requirement to parse data sent from users all over the world. Our Linux systems have a default locale of en_US.UTF-8. However, we often receive files with diacritical marks in their names such as "special_á_ã_è_characters.doc". Though the OS can deal with these files fine, and an strace shows the OS passing the correct file name to the Java program, Java munges the names and throws a "file not found" io exception trying to open them.

This simple program can illustrate the issue:

import java.io.*;
import java.text.*;

public class load_i18n
{
  public static void main( String [] args ) {
    File actual = new File(".");
    for( File f : actual.listFiles()){
      System.out.println( f.getName() );
    }
  }
}

Running this program in a directory containing the file special_á_ã_è_characters.doc and the default US English locale gives:

special_�_�_�_characters.doc

Setting the language via export LANG=es_ES@UTF-8 prints out the filename correctly (but is an unacceptable solution since the entire system is now running in Spanish.) Explicitly setting the Locale inside the program like the following has no effect either. Below I've modified the program to a) attempt to open the file and b) print out the name in both ASCII and as a byte array when it fails to open the file:

import java.io.*;
import java.util.Locale;
import java.text.*;

public class load_i18n
{
  public static void main( String [] args ) {
    // Stream to read file
    FileInputStream fin;

    Locale locale = new Locale("es", "ES");
    Locale.setDefault(locale);
    File actual = new File(".");
    System.out.println(Locale.getDefault());
    for( File f : actual.listFiles()){
      try {
        fin = new FileInputStream (f.getName());
      }
      catch (IOException e){
        System.err.println ("Can't open the file " + f.getName() + ".  Printing as byte array.");
        byte[] textArray = f.getName().getBytes();
        for(byte b: textArray){
          System.err.print(b + " ");
        }
        System.err.println();
        System.exit(-1);
      }

      System.out.println( f.getName() );
    }
  }
}

This produces the output

es_ES
load_i18n.class
Can't open the file special_�_�_�_characters.doc.  Printing as byte array.
115 112 101 99 105 97 108 95 -17 -65 -67 95 -17 -65 -67 95 -17 -65 -67 95 99 104 97 114 97 99 116 101 114 115 46 100 111 99

This shows that the issue is NOT just an issue with console display as the same characters and their representations are output in byte or ASCII format. In fact, console display does work even when using LANG=en_US.UTF-8 for some utilities like bash's echo:

[mjuric@arrhchadm30 tmp]$ echo $LANG
en_US.UTF-8
[mjuric@arrhchadm30 tmp]$ echo *
load_i18n.class special_á_ã_è_characters.doc
[mjuric@arrhchadm30 tmp]$ ls
load_i18n.class  special_?_?_?_characters.doc
[mjuric@arrhchadm30 tmp]$

Is it possible to modify this code in such a way that when run under Linux with LANG=en_US.UTF-8, it reads the file name in such a way that it can be successfully opened?

Antiquate answered 18/6, 2010 at 18:58 Comment(9)
Your example does not show you trying to open those files, just print the name. Whether Java can open the file and whether your standard output console (which has nothing to do with Java) can render the characters correctly are two very different things. Show us the code that gave the IOException and give the IOException details and stacktrace.Heilungkiang
Check out the answers recommending the use of Java system properties (user.language, user.country, user.variant) here: https://mcmap.net/q/129572/-setting-java-locale-settingsOsburn
Sorry - I never get to the point of opening the file. A call to, say FileInputStream would fail because I can't pass it the correct name of the file. The file "special_�_�_�_characters.doc" doesn't exist. The file "special_á_ã_è_characters.doc" does, but my directory iteration never lists that.Antiquate
Thanks Lauri. I tried all of those tricks and none of them worked. I actually ran an strace (Linux) during one of the runs and the OS is passing the correct filename to Java, but when Java interprets what's passed from the getdents() system call, it gets mangled. Here's the relevant system call from strace: 21993 getdents64(3, {... {d_ino=119, d_off=1692303532, d_type=DT_REG, d_reclen=48, d_name="special_á_ã_è_characters.doc"} ... }, 4096) = 704 When Java reads that and I pass that to a function to open a file, it attempts to open "special_�_�_�_characters.doc" which doesn't exist.Antiquate
Mark J, Mark P's point is that you're not proving that you can't pass the correct name of the file to the open call; you're proving that you can't print it to the console. I am more or less willing to guarantee that 'f.getName()' returns the correct filename; the problem is with the println (and hence your console destination and encoding), not the listFiles().Shaughn
Thanks Cowan. I understand that and my point was that that's not the case. I've updated the code to show a test case for that.Antiquate
I tried your program on Windows, and it worked fine. contents in :C:\tmp\test åäöáãéè special_á_ã_è_characters.doc This may be platform specific. Relevant info may be OS JVM and what filesystem. (SMB share?)Sleepyhead
have you tried creating the stream with file object without going through getName()? I.e. fin = new FileInputStream (f);Swarthy
Ddimitrov - Excellent suggestion! Unfortunately, that didn't work either, which tells me again it's not a console issue. And Karlp - yes, it's very likely platform specfic. This is a Red Hat Linux 4.x with Sun JVM 1.5.0_16. Again, it works fine when I explicitly set the language environment, but when it's set to en_US.UTF-8 (which is standard), it fails.Antiquate
K
8

First, the character encoding used is not directly related to the locale. So changing the locale won't help much.

Second, the � is typical for the Unicode replacement character U+FFFD being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:

System.out.println(new String("�".getBytes("UTF-8"), "ISO-8859-1")); // �

So there are two problems:

  1. Your JVM is reading those special characters as .
  2. Your console is using ISO-8859-1 to display characters.

For a Sun JVM, the VM argument -Dfile.encoding=UTF-8 should fix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.


Update: As per your update:

byte[] textArray = f.getName().getBytes();

That should have been the following to exclude influence of platform default encoding:

byte[] textArray = f.getName().getBytes("UTF-8");

If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version. As said before, the -Dfile.encoding argument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.

Kilowatthour answered 18/6, 2010 at 19:10 Comment(5)
I tried that and it didn't work. java -Dfile.encoding=UTF-8 load_i18n es_ES special_�_�_�_characters.doc I'm probably wrong, but I'm not convinced there's a console issue yet. I redirect the output to a file so there's no console involved and I still get the same results. I do an "od -a" on the file and here's the relevant output: 0000200 e f i l e nl s p e c i a l _ o ? 0000220 = _ o ? = _ o ? = _ c h a r a c 0000240 t e r s . d o c nl r e a d _ i 1Antiquate
As to the first problem: that may be platform/JVM specific. Hard to tell from here on. As to the second problem: is the file written with an OutputStreamWriter using UTF-8 and viewed with a viewer supporting UTF-8?Kilowatthour
@Mark, not sure why you're passing the 'mangled' filename on the command line. The flow seems to be (1) Java gets correct filename from OS (2) Java writes filename to stdout, where it gets mangled (3) you take the mangled filename and pass it back in to a different tool (4) Java hands the mangled filename to the OS, which can't find the file. Fix (2), and the problem goes away; passing the MANGLED filename in (3) is just making things worse.Shaughn
Also - "I redirect the output to a file so there's no console involved and I still get the same results." -- do you mean redirect in code, using e.g. a Writer, or using your shell's command-line redirection? If the problem is Java's choice of encoding when writing to System.out, it's just those (incorrect) bytes which your shell will redirect into the file, making exactly the same problem.Shaughn
my file name is " 03. 滫¬«Ñ¡ (feat. Äô74).mp3 " and i got error filenot found in fileinputstream plz help i use your one but still get same errorBasement
R
3

It is a bug in JRE/JDK which exists for years.

How to fix java when if refused to open a file with special character in filename?

File.exists() fails with unicode characters in name

I am now re-submitting a new bug report to them as LC_ALL=en_us will fix some cases, meanwhile it will fail some other cases.

Recuperator answered 16/5, 2011 at 4:5 Comment(0)
B
2

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters that failed to load using java.io... classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced apache FileUtils (which has the same problem) with java.nio.Files...

Bree answered 24/2, 2014 at 12:27 Comment(1)
This worked for me. The accepted answer did no good for my case.Hasdrubal
W
1

The Java system property file.encoding should match the console's character encoding. The property must be set when starting java on the command-line:

java -Dfile.encoding=UTF-8 …

Normally this happens automatically, because the console encoding is usually the platform default encoding, and Java will use the platform default encoding if you don't specify one explicitly.

Whyte answered 18/6, 2010 at 19:10 Comment(1)
file.encoding is for the file content not the file nameUpholsterer
U
1

Well I was strangled with this issue all the day! My previous (wrong) code was the same as you:

for(File f : dir.listFiles()) {
 String filename = f.getName(); // The filename here is wrong !
 FileInputStream fis = new FileInputStream (filename);
}

and it does not work (I'm using Java 1.7 Oracle on CentOS 6, LANG and LC_CTYPE=fr_FR.UTF-8 for all users except zimbra => LANG and LC_CTYPE=C - which btw is certainly the cause of this issue but I can't change this without the risk that Zimbra stops working...)

So I decided to use the new classes of java.nio.file package (Files and Paths):

DirectoryStream<Path> paths = Files.newDirectoryStream(Paths.get(outputName));
for (Iterator<Path> iterator = paths.iterator(); iterator.hasNext();) {
  Path path = iterator.next();
  String filename = path.getFileName().toString(); // The filename here is correct
  ...
}

So if you are using Java 1.7, you should give a try to new classes into java.nio.file package : it saved my day!

Hope it helps

Universally answered 19/11, 2013 at 17:24 Comment(0)
D
0

In the DirectoryStream usage then don't forget to close the stream (try-with-resources can help here)

Demerit answered 3/12, 2013 at 15:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.