getBytes() With UTF-8 Doesn't Work for Upper-Case German Umlauts
Asked Answered
S

4

6

For development I'm using ResourceBundle to read a UTF-8 encoded properties-file (I set that in Eclipse' file properties on that file) directly from my resources-directory in the IDE (native2ascii is used on the way to production), e.g.:

menu.file.open.label=&Öffnen...
label.btn.add.name=&Hinzufügen
label.btn.remove.name=&Löschen

Since that causes issues with the character encoding when using non-ASCII characters I thought I'd be happy with:

ResourceBundle resourceBundle = ResourceBundle.getBundle("messages", Locale.getDefault());
String value = resourceBundle.getString(key);
value = new String(value.getBytes(), "UTF-8");

Well, it does work nicely for lower-case German umlauts, but not for the upper-case ones, the ß also doesn't work. Here's the value read with getString(key) and the value after the conversion with new String(value.getBytes(), "UTF-8"):

&Löschen => &Löschen
&Hinzufügen => &Hinzufügen

&Ã?ber => &??ber
&SchlieÃ?en => &Schlie??en
&Ã?ffnen... => &??ffnen...

The last three should be:

&Ã?ber => &Über
&SchlieÃ?en => &Schließen
&Ã?ffnen... => &Öffnen...

I guess that I'm not too far away from the truth, but what am I missing here?

Google found something similar, but that remained unanswered.

EDIT: a little more code

Slumlord answered 3/9, 2012 at 19:52 Comment(0)
S
0

Today I was talking to one of my colleagues and he was pretty much on the same path as the other answers have mentioned. So I tried to achieve what Jon Skeet had mentioned, meaning creating the same file as in production. Since rebuilding the project after each change of a resource is out of question and I hadn't done any of what solved this (and I guess this will be new to some) let me line it out (even if it may be just for personal reference ;) ). In short this uses Eclipse' project builders.

  1. Create an Ant-style build.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <project>
        <property name="dir.resources" value="src/main/resources" />
        <property name="dir.target" value="bin/main" />
    
        <target name="native-to-ascii">
            <delete dir="${dir.target}" includes="**/*.properties" />
            <native2ascii src="${dir.resources}" dest="${dir.target}" includes="**/*.properties" />
        </target>
    </project>
    

    Its intention is to delete the properties-files in the target directory and use native2ascii to recreate them. The delete is necessary as native2ascii won't overwrite existing files.

  2. In Eclipse go to the project properties and select "Builders", click "New...", pick "Ant Builder" (that's the slightly enhanced editor for run configurations)
  3. In "Main" let "Buildfile" point to the Ant-script, set "Base Directory" to ${project_loc}
  4. In "Refresh" tick "Refresh resources upon completion" and pick "The project containing the selected resource"
  5. In "Targets" click "Set Targets" next to the "Auto Build" and pick native-to-ascii there (note that for some reason I had to do this later again)
  6. This might not be necessary for everybody, but in "JRE" pick a proper execution environment
  7. In "Build Options" tick off "Allocate Console" (however, you may want to keep this ticked on until you see that it's all working)
  8. "Apply", "OK"
  9. I was told that the newly created builder should be somewhere underneath the Java Builder (use Up/Down-button)
  10. In the "Java Build Path" select the source folder with the resources (src/main/resources for me) and add an exclusion for **/*.properties

That should have been it. If you edit a properties-file and save it, it should automatically be converted to ASCII in the output folder. You can try with entering ü, which should end up as \u00fc.

Note that if you have a lot of properties-files, this may take some time. Just don't save after every keypress. :)

Slumlord answered 4/9, 2012 at 17:13 Comment(0)
S
6

The problem is you're calling String.getBytes() without specifying an encoding - which will use the default platform encoding. You're then using the binary result of that operation as if it were in UTF-8.

If you use UTF-8 in both directions, it'll be fine:

// Should be a round-trip
value = new String(value.getBytes("UTF-8"), "UTF-8");

... but if you were trying to use this to read a UTF-8-encoded property file without telling the code which is performing the initial read, that won't work.

The code you've presented is basically always the wrong approach. Your "Since that causes issues with the character encoding" suggests that you'd already run across an earlier problem - so I'd go back to that, instead of trying to apply a broken fix. If you've already lost data when constructing the ResourceBundle, it's too late to go back later... you need to make sure the ResourceBundle itself is loaded correctly.

Please tell us exactly what problems you had with the ResourceBundle, and we can see if we can fix the root cause.

EDIT: It's not clear how you're running native2ascii. The fix may be as simple as changing to use:

native2ascii -encoding UTF-8 input.properties output.properties
Scold answered 3/9, 2012 at 19:56 Comment(10)
@sjngm: Well it will be with the round trip code, yes. That's the point. Your attempted "fix" is trying to manufacture good data out of bad data, and that doesn't work. getBytes() itself works exactly as specified - but you're trying to do something you shouldn't. Stop doing that, and explain the problem you were trying to solve.Scold
Well, the problem is that the lower-case German umlauts work nicely. Therefore, I thought I wasn't doing everything wrong. I don't know how I could tell ResourceBundle differently. The problem is that it doesn't work for upper-case umlauts and "ß".Slumlord
@sjngm: You haven't told us anything about what does go wrong though, or how you're diagnosing that. For example, if you're writing it to the console, that could well be the problem, with the string having exactly the right characters in. Give us details about the original problem - what you're seeing, where you're seeing it, how you've diagnosed it, etc.Scold
I lined it out in my question, the code block with "Löschen" in it is the output of my attempted conversion, read by ResourceBundle => new String(getBytes(), "UTF-8")Slumlord
@sjngm: The output where though? How are you running native2ascii, bearing in mind that the input file is UTF-8 (by the sounds of it), not your native encoding.Scold
I named native2ascii since we use that when building the code for production. On my IDE for development I access the messages-files directly without it. The output is the output I get in the Console-window in Eclipse and it looks similar garbled on the GUI of my application. I'm aware that running the application in a Windows console results in totally different output with all German umlauts appearing garbled.Slumlord
I really wish there were a (good, tolerable, accepted) way to forbid any of this “default platform encoding” stuff, which is always a lose. Monkey-patching libraries to forbid any versions of these functions that let you sneak by without specifying an encoding seems pretty extreme, though.Differentia
@sjngm: If you're accessing different files for development and production, that's a problem to start with. You've got to get one consistent file format, or you've got no chance.Scold
@tchrist: Agreed. Likewise defaulting the time zone and implicit uses of the current system time, IMO :)Scold
Funny you should mention the time zone one; that one bit some coworkers really bad just last week.Differentia
C
3

Some notes:

  • If it is a String it is UTF-16 and if it isn't it is a corrupt string (and too late to fix.)
  • new String(value.getBytes(), "UTF-8"); - this code will (at best) do nothing on a system that uses UTF-8 as the default encoding; otherwise it will corrupt the string.
  • .properties files must be ISO 8859-1 (the Properties type supports other formats and encodings, but I don't know how you would tell ResourceBundle that.)
  • System.out can introduce its own transcoding bugs (the PrintStream encodes UTF-16 strings to the default encoding; the receiving device must decode the bytes using the same encoding.)

I suspect you are trying to fix your problems in the wrong place.

Campeche answered 3/9, 2012 at 20:26 Comment(0)
W
2

You are encoding the text with a different encoding to the one you are decoding with.

Try instead using the same character set for encoding and decoding.

value = new String(value.getBytes("UTF-8"), "UTF-8");

String s = "ßßßßß";
s += s.toUpperCase();
s = new String(s.getBytes("UTF-8"), "UTF-8");
System.out.println(s);

prints

ßßßßßSSSSSSSSSS
Will answered 3/9, 2012 at 19:57 Comment(3)
No, the result is the original value.Slumlord
@Slumlord as it should be. If you encode a String as bytes and then decode it, you should get what you started with. What did you want to happen?Will
It also works for me, when converting a Java-variable back and forth. It doesn't when I work with the ResourceBundle.Slumlord
S
0

Today I was talking to one of my colleagues and he was pretty much on the same path as the other answers have mentioned. So I tried to achieve what Jon Skeet had mentioned, meaning creating the same file as in production. Since rebuilding the project after each change of a resource is out of question and I hadn't done any of what solved this (and I guess this will be new to some) let me line it out (even if it may be just for personal reference ;) ). In short this uses Eclipse' project builders.

  1. Create an Ant-style build.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <project>
        <property name="dir.resources" value="src/main/resources" />
        <property name="dir.target" value="bin/main" />
    
        <target name="native-to-ascii">
            <delete dir="${dir.target}" includes="**/*.properties" />
            <native2ascii src="${dir.resources}" dest="${dir.target}" includes="**/*.properties" />
        </target>
    </project>
    

    Its intention is to delete the properties-files in the target directory and use native2ascii to recreate them. The delete is necessary as native2ascii won't overwrite existing files.

  2. In Eclipse go to the project properties and select "Builders", click "New...", pick "Ant Builder" (that's the slightly enhanced editor for run configurations)
  3. In "Main" let "Buildfile" point to the Ant-script, set "Base Directory" to ${project_loc}
  4. In "Refresh" tick "Refresh resources upon completion" and pick "The project containing the selected resource"
  5. In "Targets" click "Set Targets" next to the "Auto Build" and pick native-to-ascii there (note that for some reason I had to do this later again)
  6. This might not be necessary for everybody, but in "JRE" pick a proper execution environment
  7. In "Build Options" tick off "Allocate Console" (however, you may want to keep this ticked on until you see that it's all working)
  8. "Apply", "OK"
  9. I was told that the newly created builder should be somewhere underneath the Java Builder (use Up/Down-button)
  10. In the "Java Build Path" select the source folder with the resources (src/main/resources for me) and add an exclusion for **/*.properties

That should have been it. If you edit a properties-file and save it, it should automatically be converted to ASCII in the output folder. You can try with entering ü, which should end up as \u00fc.

Note that if you have a lot of properties-files, this may take some time. Just don't save after every keypress. :)

Slumlord answered 4/9, 2012 at 17:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.