Why does my Unicode String get corrupted, when passed from Java Applet to Java Script?
Asked Answered
V

4

8

I'm pretty new, so don't be too harsh :)

Question(tl;dr)

I'm facing a problem passing an unicode String from an embedded javax.swing.JApplet in a web page to the Java Script part. I'm not sure this is whether a bug or a misunderstanding of the involved technologies:

Problem

I want to pass a unicode string from a Java Applet to Java Script, but the String gets messed up. Strangely, the problem doesn't occur not in Internet Explorer 10 but in Chrome (v26) and Firefox (v20). I haven't tested other browsers though.

The returned String seems to be okay, except for the last unicode character. The result in the Java Script Debugger and Web Page would be:

  • abc → abc
  • 表示 → 表��
  • ま → ま
  • ウォッチリスト → ウォッチリス��
  • アップロード → アップロー��
  • ホ → ��
  • ホ → ホ (Not deterministic)
  • アップロードabc → アップロードabc

The string seems to get corrupted at the last bytes. If it ends with an ASCII character the string is okay. Additionally the problem doesn't occur within every combination and also not every time (not sure on this). Therefore I suspect a bug and I'm afraid I might be posting an invalid question.

Test Set Up

A minimalistic set up includes an applet that returns some unicode (UTF-8) strings:

/* TestApplet.java */
import javax.swing.*;

public class TestApplet extends JApplet {

private String[] testStrings = {
            "abc", // OK (because ASCII only)
            "表示", // Error on last Character
            "表示", // Error on last Character
            "ホーム ", // OK (because of *space* after ム)
            "アップロード", ... }; 
    public TestApplet() {...};     // Applet specific stuff

    ...

    public int getLength() { return testStrings.length;};

    String getTestString(int i) {
        return testStrings[i];    // Build-in array functionality because of IE. 
    }
}

The corresponding web page with java script could look like this:

 /* test.html */
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
        <span id="output"/>
        <applet id='output' archive='test.jar' code=testApplet/>
    </body>

    <script type="text/javascript" charset="utf-8">
        var applet = document.getElementById('output');
        var node = document.getElementById("1");
        for(var i = 0; i < applet.getLength(); i++) {
             var text = applet.getTestString(i);
         var paragraphNode = document.createElement("p");
         paragraphNode.innerHTML = text;
         node.appendChild(paragraphNode);
        }
    </script>
</html>

Environment

I'm working on Windows 7 32-Bit with the current Java Version 1.7.0_21 using the "Next Generation Java Plug-in 10.21.2 for Mozilla browsers". I had some problems with my operating system locale, but I tried several (English, Japanese, Chinese) regional settings.

In case of an corrupt String chrome shows invalid characters (e.g. ��). Firefox, on the other hand, drops the string completly, if it would be ending with ��.

Internet explorer manages to display the strings correctly.

Solutions?

I can imagine several workarounds, including escaping/unescaping and adding a "final char" which then is removed via java script. Actually I'm planning to write against Android's Webkit, and I haven't tested it there.

Since I would like to continue testing in Chrome, (because of Webkit technology and comfort) I hope there is a trivial solution to the problem, which I might have overlooked.

Viceroy answered 3/5, 2013 at 13:22 Comment(7)
I'm interested in what the real problem is. One idea I found is: make sure javac and/or jar uses UTF8 encoding - if you don't specify it, it uses the machine default (which could be a problem)Brodeur
Thanks ! I'll try this later on. I want to point out, that the data flow from java script to applet (calling parameter) works as expected. Only the return gets messed up.Viceroy
Absolutely. You showed/explained that it all works fine, except for the string returned in special cases (the last character in the returned string has a unicode character). I think you explained the situation very well and laid out everything in a very organized way :)Brodeur
Please can you show the code that actually writes the string to send it to the browser?Ranchero
As it's possibly a duplicate of Java not defaulting to UTF8 encoding for strings #81823Ranchero
@Ranchero is right. Can we see the code that actually sends the string to the browser. By default Java Strings are UTF-16 encoded and the code to send to the browser needs to explicitly specify the encoding.Plaid
@Ranchero and VikrantY, I'm not sure if I understand your request completley. I'm relying on the built in JavaScript2Java functionality of the browser. I guess this is something like the former LiveConnect (developer.mozilla.org/de/docs/JavaScript/Guide/…) for chrome. The java function, as shown in the example, is getTestString(int i). The JavaScript call is applet.getTestSring(i).Viceroy
P
1

If you are testing in Chrome/Firefox

Please replace first line with this and then test it,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

The Doctype has significant value while browser identifies the page.

Transitional /loose it the types you can use with Unicode. Please test and reply..

Parthenia answered 8/5, 2013 at 9:10 Comment(2)
Thank you for your input! I have tried this, but still no luck.Viceroy
Can you post html of page after generation of page/ link of the page(if live), that will help further.Parthenia
J
1

I suggest to set a breakpoint on

paragraphNode.innerHTML = text;

and inspect text it in the JavaScript console, e.g. with

console.log(escape(text));

or

console.log(encodeURIComponent(text));

or

for (i=0; i < text.length; i++) {
    console.log("i = "+i);
    console.log("text.charAt(i) = "+text.charAt(i)
    +", text.charCodeAt(i) = "+text.charCodeAt(i));
}

See also

http://www.fileformat.info/info/unicode/char/30a6/index.htm

https://developer.mozilla.org/en-US/docs/DOM/window.escape (which is not part of any standard)

and

https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/encodeURIComponent

or similar resources.

Your source files may not be in the encoding you assume (UTF-8).

JavaScript assumes UTF-16 strings:

http://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16

Java also assumes UTF-16:

http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html

The Linux or Cygwin file command can show you the encoding of your files.

See

http://linux.die.net/man/1/file (haven't found a kernel.org man reference)

Jemmie answered 8/5, 2013 at 10:12 Comment(1)
Thank you very much for your elaborated answer! With the encodeURI function I was able to output the final "corrupt" bytes in chrome : They seem to all end with %EF%BF%BD%EF%BF%BD%00. Not sure if it's the real characteristic, because firefox doesn't shows a corrupted string at all (returns a string with a length of 0 in this case). Actually I was able to solve the problem for my OS (see my embarassing answer). But it still affects other locale ... Maybe the question remains valid with modification.Viceroy
G
1

You need to make sure to add the following Java Argument to your applet/embed tag:

-Dfile.encoding=utf-8

i.e. java_arguments="-Dfile.encoding=utf-8"

Otherwise it is going to expect and treat the applet as ASCII text.

Gorge answered 31/5, 2013 at 10:1 Comment(0)
V
0

Okay, I'm a little bit embarassed, because I thought I tried it enough: I was actually using non-latin locale (e.g Chinese(PRC) or Japanese(Japan) in the windows' system locale settings. When I changed back to English(USA) or German(Germany) everything worked as excpected.

I'm still wondering, why it would affect Chrome & Mozilla in such a strange way, because Java and modern browsers should be unicode-based; So I won't accept this as an answer! The problem reoccurs by switching back to japanese and I'm going to test it on different systems.

I want to thank for all the posters for the enlightning input... and I will still putting some effort in solving this question.

Viceroy answered 8/5, 2013 at 20:13 Comment(1)
Can't you solve escaping ascii chars?Isma

© 2022 - 2024 — McMap. All rights reserved.