Read file and write file which has characters in UTF - 8 (different language)
Asked Answered
T

3

7

I have a file which has characters like: " Joh 1:1 ஆதியிலே வார்த்தை இருந்தது, அந்த வார்த்தை தேவனிடத்திலிருந்தது, அந்த வார்த்தை தேவனாயிருந்தது. "

www.unicode.org/charts/PDF/U0B80.pdf

When I use the following code:

bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, "UTF8"));

The output is boxes and other weird characters like this:

"�P�^����O֛���;�<�aYՠ؛"

Can anyone help?

these are the complete codes:

File f=new File("E:\\bible.docx");
        Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8);
        bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8));
        char[] buffer = new char[1024];
        int n;
        StringBuilder build=new StringBuilder();
        while(true){
            n=decoded.read(buffer);
            if(n<0){break;}
            build.append(buffer,0,n);
            bufferedWriter.write(buffer);
        }

enter image description here

The StringBuilder value shows the UTF characters but when displaying it in the window it shows as boxes..

Found the Answer to the problem!!! The Encoding is Correct (i.e UTF-8) Java reads the file as UTF-8 and the String characters are UTF-8, The problem is that there is no font to display it in netbeans' output panel. After changing the font for the output panel (Netbeans->tools->options->misc->output tab) I got the expected result. The same applies when it is displayed in JTextArea(font needs to be changed). But we can't change font the windows' cmd prompt.

Tsana answered 1/8, 2013 at 3:56 Comment(6)
How do you read the file? do you have the code you use for reading?Preussen
You're providing the charset name as a string literal. The name, according to the documentation, is "UTF-8".Theone
Verify in a debugger that the strings contain the Unicode characters you expect. Then verify that the output device you use, support UTF8.Aday
Show the code where you read the data.Town
To read a docx file, you need a docx reader. You cannot read it as if it were plain text. The problem is not the language, it is the file format.Predestinate
Found the Answer to the problem; The Encoding is Correct (i.e UTF-8)Tsana
T
5

Because your output is encoded in UTF-8, but still contains the replacement character (U+FFFD, �), I believe the problem occurs when you read the data.

Make sure that you know what encoding your input stream uses, and set the encoding for the InputStreamReader according. If that's Tamil, I would guess it's probably in UTF-8. I don't know if Java supports TACE-16. It would look something like this…

StringBuilder buffer = new StringBuilder();
try (InputStream encoded = ...) {
  Reader decoded = new InputStreamReader(encoded, StandardCharsets.UTF_8);
  char[] buffer = new char[1024];
  while (true) {
    int n = decoded.read(buffer);
    if (n < 0)
      break;
    buffer.append(buffer, 0, n);
  }
}
String verse = buffer.toString();
Town answered 1/8, 2013 at 4:9 Comment(8)
@Theone If you mean UTF8 instead of UTF-8, no. UTF8 is an alias for the UTF-8 encoding. If the encoding isn't found, most APIs will throw an UnsupportedEncodingExceptionTown
Got it. Thanks. I have no business answering Java questions anyway.Theone
File f=new File("E:\\bible.docx"); Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8); bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8)); char[] buffer = new char[1024]; int n; StringBuilder build=new StringBuilder(); while(true){ n=decoded.read(buffer); if(n<0){break;} build.append(buffer,0,n); bufferedWriter.write(buffer); }Tsana
@Tsana The easiest way to see if the input decoding is correct is to look at the decoded characters in memory with a debugger. If you aren't familiar with your debugger, you could print the numeric value of some of the characters. They should be in the range 0x0B80-0x0BFFTown
Also, are you sure the input is UTF-8 encoded? That was a guess on my part. I'm not familiar with the encodings used for Tamil. Is the document actually Microsoft Word's XML format? If so, what encoding is specified in the XML?Town
char array has the UTF chararcters in it,Tsana
I can exactly copy from input file to output file.. But I cudn't display the characters in system stream (System.out) using both NetBeans as well as in Command Prompt.. I don't know y?Tsana
If that's the case, then it was probably just your console settings.Town
O
1

System.out is too near to the operating system, to be versatile enough. In your case, the NetBeans console probably is using the operating system encoding, and IDE picked font.

Write to a file first. If you make it HTML, you can even double click it, and specify internally the right encoding. Mind using "UTF-8" then, as "UTF8" is Java specific ("UTF-8" can be used in Java too). Maybe with JDesktop.getDesktop().open("... .html");.

A small JFrame with a JTextPane would do too.

Overwinter answered 1/8, 2013 at 13:3 Comment(0)
S
0

It turns out that Tamil is encoded in 16 bits, so just use UTF-16 instead of UTF-8. By doing that I was able to print Tamil text in the Eclipse console.

Sharynshashlik answered 24/11, 2015 at 15:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.