Read file and write file which has characters in UTF - 8 (different language)

Asked 1/8, 2013 at 3:56 Answered 24/11, 2015 at 15:9

I have a file which has characters like: " Joh 1:1 ஆதியிலே வார்த்தை இருந்தது, அந்த வார்த்தை தேவனிடத்திலிருந்தது, அந்த வார்த்தை தேவனாயிருந்தது. "

www.unicode.org/charts/PDF/U0B80.pdf‎

When I use the following code:

bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, "UTF8"));

The output is boxes and other weird characters like this:

"�P�^��O֛��;�<�aYՠ؛"

Can anyone help?

these are the complete codes:

File f=new File("E:\\bible.docx");
        Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8);
        bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8));
        char[] buffer = new char[1024];
        int n;
        StringBuilder build=new StringBuilder();
        while(true){
            n=decoded.read(buffer);
            if(n<0){break;}
            build.append(buffer,0,n);
            bufferedWriter.write(buffer);
        }

enter image description here

The StringBuilder value shows the UTF characters but when displaying it in the window it shows as boxes..

Found the Answer to the problem!!! The Encoding is Correct (i.e UTF-8) Java reads the file as UTF-8 and the String characters are UTF-8, The problem is that there is no font to display it in netbeans' output panel. After changing the font for the output panel (Netbeans->tools->options->misc->output tab) I got the expected result. The same applies when it is displayed in JTextArea(font needs to be changed). But we can't change font the windows' cmd prompt.

Tsana answered 1/8, 2013 at 3:56 Comment(6)

How do you read the file? do you have the code you use for reading? – Preussen 1/8, 2013 at 4:0

You're providing the charset name as a string literal. The name, according to the documentation, is "UTF-8". – Theone 1/8, 2013 at 4:7

Verify in a debugger that the strings contain the Unicode characters you expect. Then verify that the output device you use, support UTF8. – Aday 1/8, 2013 at 4:19

Show the code where you read the data. – Town 1/8, 2013 at 4:21

To read a docx file, you need a docx reader. You cannot read it as if it were plain text. The problem is not the language, it is the file format. – Predestinate 1/8, 2013 at 6:5

Found the Answer to the problem; The Encoding is Correct (i.e UTF-8) – Tsana 6/8, 2013 at 9:30

Because your output is encoded in UTF-8, but still contains the replacement character (U+FFFD, �), I believe the problem occurs when you read the data.

Make sure that you know what encoding your input stream uses, and set the encoding for the InputStreamReader according. If that's Tamil, I would guess it's probably in UTF-8. I don't know if Java supports TACE-16. It would look something like this…

StringBuilder buffer = new StringBuilder();
try (InputStream encoded = ...) {
  Reader decoded = new InputStreamReader(encoded, StandardCharsets.UTF_8);
  char[] buffer = new char[1024];
  while (true) {
    int n = decoded.read(buffer);
    if (n < 0)
      break;
    buffer.append(buffer, 0, n);
  }
}
String verse = buffer.toString();

Town answered 1/8, 2013 at 4:9 Comment(8)

@Theone If you mean UTF8 instead of UTF-8, no. UTF8 is an alias for the UTF-8 encoding. If the encoding isn't found, most APIs will throw an UnsupportedEncodingException – Town 1/8, 2013 at 4:19

Got it. Thanks. I have no business answering Java questions anyway. – Theone 1/8, 2013 at 4:23

File f=new File("E:\\bible.docx"); Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8); bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8)); char[] buffer = new char[1024]; int n; StringBuilder build=new StringBuilder(); while(true){ n=decoded.read(buffer); if(n<0){break;} build.append(buffer,0,n); bufferedWriter.write(buffer); } – Tsana 1/8, 2013 at 5:29

@Tsana The easiest way to see if the input decoding is correct is to look at the decoded characters in memory with a debugger. If you aren't familiar with your debugger, you could print the numeric value of some of the characters. They should be in the range 0x0B80-0x0BFF – Town 1/8, 2013 at 5:35

Also, are you sure the input is UTF-8 encoded? That was a guess on my part. I'm not familiar with the encodings used for Tamil. Is the document actually Microsoft Word's XML format? If so, what encoding is specified in the XML? – Town 1/8, 2013 at 5:38

char array has the UTF chararcters in it, – Tsana 1/8, 2013 at 7:31

I can exactly copy from input file to output file.. But I cudn't display the characters in system stream (System.out) using both NetBeans as well as in Command Prompt.. I don't know y? – Tsana 1/8, 2013 at 12:50

If that's the case, then it was probably just your console settings. – Town 1/8, 2013 at 17:33

System.out is too near to the operating system, to be versatile enough. In your case, the NetBeans console probably is using the operating system encoding, and IDE picked font.

Write to a file first. If you make it HTML, you can even double click it, and specify internally the right encoding. Mind using "UTF-8" then, as "UTF8" is Java specific ("UTF-8" can be used in Java too). Maybe with JDesktop.getDesktop().open("... .html");.

A small JFrame with a JTextPane would do too.

Overwinter answered 1/8, 2013 at 13:3 Comment(0)

It turns out that Tamil is encoded in 16 bits, so just use UTF-16 instead of UTF-8. By doing that I was able to print Tamil text in the Eclipse console.

Sharynshashlik answered 24/11, 2015 at 15:9 Comment(0)

Recommended topics

Hot tags