Java Charset problem on linux
Asked Answered
L

3

12

problem: I have a string containing special characters which i convert to bytes and vice versa..the conversion works properly on windows but on linux the special character is not converted properly.the default charset on linux is UTF-8 as seen with Charset.defaultCharset.getdisplayName()

however if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..

how to make it work using the UTF-8 default charset and not setting the -D option in unix environment.

edit: i use jdk1.6.13

edit:code snippet works with cs = "ISO-8859-1"; or cs="UTF-8"; on win but not in linux

        String x = "½";
        System.out.println(x);
        byte[] ba = x.getBytes(Charset.forName(cs));
        for (byte b : ba) {
            System.out.println(b);
        }
        String y = new String(ba, Charset.forName(cs));
        System.out.println(y);

~regards daed

Lynelllynelle answered 30/1, 2010 at 15:22 Comment(1)
can you please post your code?Malindamalinde
P
11

Your characters are probably being corrupted by the compilation process and you're ending up with junk data in your class file.

if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

In short, don't use -Dfile.encoding=...

    String x = "½";

Since U+00bd (½) will be represented by different values in different encodings:

windows-1252     BD
UTF-8            C2 BD
ISO-8859-1       BD

...you need to tell your compiler what encoding your source file is encoded as:

javac -encoding ISO-8859-1 Foo.java

Now we get to this one:

    System.out.println(x);

As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:

 System.out.write(x.getBytes(Charset.defaultCharset()));

That may or may not work as you expect on some platforms - the byte encoding must match the encoding the console is expecting for the characters to show up correctly.

Phalange answered 30/1, 2010 at 16:18 Comment(1)
many Thanks..i completely forgot about this aspect - javac -encoding ISO-8859-1..i will check this out and get back..Lynelllynelle
M
3

Your problem is a bit vague. You mentioned that -Dfile.encoding solved your linux problem, but this is in fact only used to inform the Sun(!) JVM which encoding to use to manage filenames/pathnames at the local disk file system. And ... this does't fit in the problem description you literally gave: "converting chars to bytes and back to chars failed". I don't see what -Dfile.encoding has to do with this. There must be more into the story. How did you conclude that it failed? Did you read/write those characters from/into a pathname/filename or so? Or was you maybe printing to the stdout? Did the stdout itself use the proper encoding?

That said, why would you like to convert the chars forth and back to/from bytes? I don't see any useful business purposes for this.

(sorry, this didn't fit in a comment, but I will update this with the answer if you have given more info about the actual functional requirement).

Update: as per the comments: you basically just need to configure the stdout/cmd so that it uses the proper encoding to display those characters. In Windows you can do that with chcp command, but there's one major caveat: the standard fonts used in Windows cmd does not have the proper glyphs (the actual font pictures) for characters outside the ISO-8859 charsets. You can hack the one or other in registry to add proper fonts. No wording about Linux as I don't do it extensively, but it look like that -Dfile.encoding is somehow the way to go. After all ... I think it's better to replace cmd with a crossplatform UI tool to display the characters the way you want, for example Swing.

Mcdonough answered 30/1, 2010 at 15:52 Comment(5)
posted code snip iam pretty much confused about this file.encodingLynelllynelle
Okay.. How about the stdout? The thing where the System.out.printXX() goes to. Did it use the proper encoding? E.g. if in an IDE, this is configureable in its preferences, or if in command console, this is configureable in its preferences. I don't do linux extensively, but it look like that the -Dfile.encoding has somehow actually influence on stdout encoding in linux's JVM.Mcdonough
iam running it from the cmd prompt and printing it there. Also i am copying the same class file compiled on windows on to linux and running itLynelllynelle
Well, then you basically just need to configure the cmd so that it uses the proper encoding to display those characters. Just to test, try to write those chars into a file (not as file name! but as file content) using OutputStreamWriter(file, encoding) and you should see that the characters are properly written (as long as your file viewer recognizes/uses the proper encoding to display them ;) ).Mcdonough
can you tell what cfg is needed on cmd prompt?Lynelllynelle
H
1

You should make the conversion explicitly:

byte[] byteArray = "abcd".getBytes( "ISO-8859-1" );
new String( byteArray, "ISO-8859-1" );

EDIT:

It seems that the problem is the encoding of your java file. If it works on windows, try compiling the source files on linux with javac -encondig ISO-8859-1. This should solve your problem.

Hild answered 30/1, 2010 at 15:26 Comment(3)
or new String(bytes, "iso-8859-1") in this case, of course.Roselani
thx for responding ...i tried using UTF-8 exactly as above and on windows i still get correct results..but i dont do that on linux since it uses by default UTF-8 but it is unable to decode.. it appears to me as if utf-8 is diff on windows and linux..??Lynelllynelle
UTF-8 is the same everywhere, could you please check the .java file encoding. Sometimes there are subtile bugs when moving files from one platform to the other.Uveitis

© 2022 - 2024 — McMap. All rights reserved.