Why can't I use \u000D and \u000A as CR and LF in Java?

About

Asked 5/10, 2010 at 17:36 Answered 2/7, 2021 at 2:16

Why can't I use \u000D and \u000A as CR and LF in Java? It's giving an error when I compile the code:

String x = "\u000A hello";//Error - Illegal escape character in string literal.

Ivanna answered 5/10, 2010 at 17:36 Comment(0)

121

Unicode escapes are pre-processed before the compiler is run. Therefore, if you put \u000A in a String literal like this:

String someString = "foo\u000Abar";

It will be compiled exactly as if you wrote:

String someString = "foo
bar";

Stick to \r (carriage return; 0x0D) and \n (line feed; 0x0A)

Bonus: You can always have fun with this, especially given the limitations on most syntax highlighters. Next time you've got a sec, try running this code:

public class FalseIsTrue {
    public static void main(String[] args) {
        if ( false == true ) { //these characters are magic: \u000a\u007d\u007b
            System.out.println("false is true!");
        }
    }
}

Polyclitus answered 5/10, 2010 at 17:40 Comment(8)

Great and unexpected to me. But why is e.g. \u0008 for backspace then not working to delete the previous code? – Kutchins 22/11, 2011 at 12:56

@stacktracer: :-). If that's a serious question, I guess the answer would be that nothing forces the preprocessor to interpret that character as a command to erase a previous character. Eclipse's console ignores it too. Just like putting a \u000d doesn't actually result in the computer's carriage returning! – Polyclitus 22/11, 2011 at 21:41

This is something that C# solved much more elegantly, allowing those escapes only for strings, chars and identifiers. That way you can still have non-ASCII identifiers if you cannot type them but you won't get a chance at mangling source code that way. And the preprocessing has to be done in the compiler anyway. The Java folks apparently didn't learn from trigraphs in C, which can have similar code-breaking properties. – Oestriol 22/10, 2012 at 5:27

@Joey: That does seem to be an improvement. However, as a result of their choices identifiers are more confusing. For example, // is a valid identifier iff you use unicode escapes. It's much easier to reason about how the preprocessor works in Java, but it's also much easier to shoot yourself in the foot. – Polyclitus 22/10, 2012 at 14:43

Mark, no, identifiers are still restricted to the same set of characters which are essentially all Unicode letters (not much different from Java, in fact). The grammar defines letter-character as either “A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl” or “A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl”. You can also use identifiers that collide with keywords by either using a Unicode escape or prefixing the identifier with @ (although that's more useful to code generators, usually). – Oestriol 22/10, 2012 at 15:44

@Joey: What am I experiencing here then? ideone.com/2tiB7B Is this because ideone uses Mono? I can declare a field "//" and then access it using Reflection using "/". – Polyclitus 22/10, 2012 at 16:29

Sounds like a bug in Mono where it doesn't adhere to the specification. Microsoft's own C# compiler (4.0.30319.17929) rejects that code with blah.cs(5,19): error CS1056: Unexpected character '\u002F' (a few times) and blah.cs(5,19): error CS1519: Invalid token '\u002F' in class, struct, or interface member declaration. Although I would guess that there isn't much C# code out there that tries to (ab)use this. – Oestriol 22/10, 2012 at 16:57

This seems like an obfuscated security breach waiting to happen... Seriously, how many Java programmers would know about this? And who on earth decided that it was a good idea to unescape unicode characters before parsing comments and string delimiters?? – Avra 5/2, 2019 at 9:11

Because it falls within the range of Unicode Control characters

Which is U+0000–U+001F and U+007F.

Unicode control characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation.

They can be escaped by using \ like described in above answer by @Mark

FROM RFC:

2.5. Strings

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped.

Uncovenanted answered 20/4, 2016 at 11:47 Comment(0)

建议最好不要用 It's better not to use it 试一下下面的代码，你就明白了 Try the following code and you'll see

public static void main(String[] args) {
    String a = "Hello";
    // \u000d a="world";
    System.out.println(a);
    // \u000a a="hello world!";
    System.out.println(a);
}

Apostolic answered 2/7, 2021 at 2:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags