Compiling (javac) a UTF8 encoded Java source code with a BOM
Asked Answered
C

3

25

Hello and thank you for reading my post.

My problem is the following: I want to compile a Java source file with "javac" with this file being UTF-8 encoded with a BOM (the OS is WinXP).

Below is what I do:

1) Create a file with "Notepad" and choose the UTF-8 encoding

dos> notepad Test.java
"File -> Save as..."
File name   : Test.java
Save as type: All Files
Encoding    : UTF-8
Save

2) Create a Java class in that file and saved the file like in 1)

public class Test
{
    public static void main(String [] args)
    {
        System.out.println("This is a test.");
    }
}

3) Visualize the hexadecimal version of the file (first line)

dos> xxd Test.java | head -1
0000000: efbb bf70 7562 6c69 6320 636c 6173 7320  ...public class

Note: ef bb bf is the UTF-8 encoded BOM (the UTF-16 encoded BOM being FE FF).

4) Try to compile this code with "javac"

dos> javac -encoding utf8 Test.java
Test.java:1: illegal character: \65279
?public class Test
^
1 error

Note: 65279 is the decimal version of the BOM.

My question is the following: how can I make this compiling work with:

  • keeping it UTF-8 encoded
  • and keeping the BOM?

Thank you for helping and best regards.

Léa

Crony answered 21/3, 2012 at 19:17 Comment(4)
That’s right: you have to remove the BOM. It has no business in UTF-8, so of course it is an error. This is a long-standing Microsoft bug. Never ever put a BOM in UTF-8!!!!!Convoy
Hello. Thank you for your answer. I used "Notepad++" to encode the file as "UTF8 without BOM". Compiling the code with "javac" now works.Killdeer
@Convoy The Unicode Standard (page 30) allows for a BOM in UTF-8 so you have every right to put it there if you so wish. Why you'd want to is another story, but javac should handle it.Chesty
possible duplicate of How to compile a java source file which is encoded as "UTF-8"?Fluoroscopy
S
37

Trim the BOM and then use javac -encoding utf8 x.java

Spermatozoon answered 3/2, 2013 at 13:3 Comment(4)
This solved my javac compiling problem. But now Windows10 console still showing unknown characters like "???????????".Enamour
Afaiu, chcp 65001 should help you with console.Spermatozoon
Tried this also, issue not resolved. Open question marks "?????" converted into boxed question marks. Windows console still not recognizing text. Here that shows correct like: लोकसभा के चुनावी रण में सत्तारूढ़ भाजपा की ओर से सिर्फ नरेन्द्र मोदी ही दिखाई दे रहे हैं।Enamour
This is what I haven't been able to solve for at least 3 months. Thanks for the stackoverflow!Burk
P
20

This isn't a problem with your text editor, it's a problem with javac ! The Unicode spec says BOM is optionnal in UTF-8, it doesn't say it's forbidden ! If a BOM can be there, then javac HAS to handle it, but it doesn't. Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.

The proposed solution of removing the BOM is only a workaround and not the proper solution.

This bug report indicates that this "problem" will never be fixed : https://web.archive.org/web/20160506002035/http://bugs.java.com/view_bug.do?bug_id=4508058

Since this thread is in the top 2 google results for the "javac BOM" search, I'm leaving this here for future readers.

Pharisee answered 20/1, 2015 at 10:45 Comment(1)
Actually, the bug you reference has to do with the UTF-8 decoder; it has nothing to do with whether the compiler can be altered to detect and discard any BOM on a Java source file, which it can, and should.Tbar
S
-1

https://mcmap.net/q/526462/-compiling-javac-a-utf8-encoded-java-source-code-with-a-bom

Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.

Actually

  • BOM is not about distinguishing ANSI and Unicode. Do not use a feature on purpose it is not designed for.

  • UTF-8 was designed to be backward-compatible with ANSI intentionally, so a lot of code written to process formatted text relied on 0..127 bytes only (XML, JSON, etc.) should work correctly with UTF-8 encoded text without any modifications.

Sacellum answered 9/7, 2019 at 22:16 Comment(2)
note: it is byte-level compatibility only, but char-level calculations became wrong when UTF-8 used in place of ANSI.Posen
UTF-8 is only backward-compatible with ASCII (7-bit range, 0x0 - 0x7F), not also with ANSI (an ASCII extension that also defines characters in the 8-bit range, 0x80 - 0xFF, and that range is not compatible with UTF-8). Yes, a BOM in a UTF-8 file serves to distinguish it from an ANSI (or OEM, ....) file.Melody

© 2022 - 2024 — McMap. All rights reserved.