What are "connecting characters" in Java identifiers?
Asked Answered
A

7

211

I am reading for SCJP and I have a question regarding this line:

Identifiers must start with a letter, a currency character ($), or a connecting character such as the underscore ( _ ). Identifiers cannot start with a number!

It states that a valid identifier name can start with a connecting character such as underscore. I thought underscores were the only valid option? What other connecting characters are there?

Ascetic answered 2/8, 2012 at 8:54 Comment(4)
Regarding "a currency character": UK visitors to this question may be suprised and interested to know that, consistent with being able to start with "a" currency character, Java identifiers can, legally, begin with the pound symbol (£).Jaenicke
Note that since Java 8, _ is a "deprecated" identifier. Specifically, the compiler emits the following warning: (use of '_' as an identifier might not be supported in releases after Java SE 8).Skepticism
@Skepticism Yup. Brian Goetz says they are "reclaiming" _ for use in future language features. Identifiers that start with an underscore are still okay, but a single underscore is an error if used as a lambda parameter name, and a warning everywhere else.Hastings
For the bytecode, anything by sequence that does not contain . ; [ / < > : goes: #26791704 docs.oracle.com/javase/specs/jvms/se7/html/… Everything else is a Java-only restriction.Griner
M
270

Here is a list of connecting characters. These are characters used to connect words.

http://www.fileformat.info/info/unicode/category/Pc/list.htm

U+005F _ LOW LINE
U+203F ‿ UNDERTIE
U+2040 ⁀ CHARACTER TIE
U+2054 ⁔ INVERTED UNDERTIE
U+FE33 ︳ PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE34 ︴ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
U+FE4D ﹍ DASHED LOW LINE
U+FE4E ﹎ CENTRELINE LOW LINE
U+FE4F ﹏ WAVY LOW LINE
U+FF3F _ FULLWIDTH LOW LINE

This compiles on Java 7.

int _, ‿, ⁀, ⁔, ︳, ︴, ﹍, ﹎, ﹏, _;

An example. In this case tp is the name of a column and the value for a given row.

Column<Double> ︴tp︴ = table.getColumn("tp", double.class);

double tp = row.getDouble(︴tp︴);

The following

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++)
    if (Character.isJavaIdentifierStart(i) && !Character.isAlphabetic(i))
        System.out.print((char) i + " ");
}

prints

$ _ ¢ £ ¤ ¥ ؋ ৲ ৳ ৻ ૱ ௹ ฿ ៛ ‿ ⁀ ⁔ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ꠸ ﷼ ︳ ︴ ﹍ ﹎ ﹏ ﹩ $ _ ¢ £ ¥ ₩

Malamud answered 2/8, 2012 at 8:59 Comment(6)
I am looking forward to the day when I inherit some code that uses these identifiers!Indubitable
BTW You can use any of the currency symbols as well. int ৲, ¤, ₪₪₪₪; :DMalamud
@GrahamBorland How about if( ⁀ ‿ ⁀ == ⁀ ⁔ ⁀) or if ($ == $) or if (¢ + ¢== ₡) or if (B + ︳!= ฿)Malamud
@FredOverflow It is the Drachma currency sign. No country uses it, but if the worst happen in Europe it may come back. en.wikipedia.org/wiki/Greek_drachmaMalamud
Scalaz uses stuff like KleisliArrow[M[]: Monad]: Arrow[({type λ[α, β]=Kleisli[M, α, β]})#λ] = new Arrow[({type λ[α, β]=Kleisli[M, α, β]})#λ] and ☆(f() η) all the time.Gladysglagolitic
Try checking isJavaIdentifierPart instead of isJavaIdentifierStart. It's much more fun!Deflocculate
D
25

Iterate through the whole 65k chars and ask Character.isJavaIdentifierStart(c). The answer is: "undertie" decimal 8255

Drugget answered 2/8, 2012 at 8:57 Comment(2)
I couldn't resist (in Scala): (1 to 65535).map(_.toChar).filter(Character.isJavaIdentifierStart).size - yields 48529 characters...Sublime
Total count = 90648, but I'm going to Character.MAX_CODE_POINT, which is probably more than 2<<16.Everlasting
E
7

The definitive specification of a legal Java identifier can be found in the Java Language Specification.

Exalted answered 2/8, 2012 at 8:59 Comment(4)
I'm not sure that actually fully answers the (implied) question of which characters may start a Java identifier. Following links we end up at Character.isJavaIdentifierStart() which states A character may start a Java identifier if and only if one of the following conditions is true: ... ch is a currency symbol (such as "$"); ch is a connecting punctuation character (such as "_").Alegar
It seems that the specification leaves the final list of acceptable characters up to the implementation, so it could potentially be different for everybody.Exalted
@GregHewgill That'd be foolish, considering how tightly specified everything else is. I think that these are actual Unicode character classes, which are defined (where else?) in the Unicode standard. isJavaIdentifierStart() mentions getType(), and currency symbol and connector punctuation are both also types that can be returned by that function, so the lists might be given there. "General category" is in fact a specific term in the Unicode standard. So the valid values would be L [all], Nl, Sc, Pc.Trull
@GregHewgill is correct. The specification is short and clear, and it's defined by Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart(). The End. The key thing to remember is that Unicode is evolving; don't fall into the trap of thinking of character sets as finished (Latin is a terrible example; ignore it). Characters are created all the time. Ask your Japanese friends. Expect legal java identifiers to change over time - and that's intentional. The point is to let people write code in human languages. That leads to a hard requirement for allowing change.Gladysglagolitic
M
6

Here is a List of connector Characters in Unicode. You will not find them on your keyboard.

U+005F LOW LINE _
U+203F UNDERTIE ‿
U+2040 CHARACTER TIE ⁀
U+2054 INVERTED UNDERTIE ⁔
U+FE33 PRESENTATION FORM FOR VERTICAL LOW LINE ︳
U+FE34 PRESENTATION FORM FOR VERTICAL WAVY LOW LINE ︴
U+FE4D DASHED LOW LINE ﹍
U+FE4E CENTRELINE LOW LINE ﹎
U+FE4F WAVY LOW LINE ﹏
U+FF3F FULLWIDTH LOW LINE _

Mitzvah answered 2/8, 2012 at 8:59 Comment(1)
I don't know what keyboard layout you're using, but I can certainly type _ (U+005F) easily enough :)Just
C
4

A connecting character is used to connect two characters.

In Java, a connecting character is the one for which Character.getType(int codePoint)/Character.getType(char ch) returns a value equal to Character.CONNECTOR_PUNCTUATION.

Note that in Java, the character information is based on Unicode standard which identifies connecting characters by assigning them the general category Pc, which is an alias for Connector_Punctuation.

The following code snippet,

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++) {
    if (Character.getType(i) == Character.CONNECTOR_PUNCTUATION
            && Character.isJavaIdentifierStart(i)) {
        System.out.println("character: " + String.valueOf(Character.toChars(i))
                + ", codepoint: " + i + ", hexcode: " + Integer.toHexString(i));
    }
}

prints the connecting characters that can be used to start an identifer on jdk1.6.0_45

character: _, codepoint: 95, hexcode: 5f
character: ‿, codepoint: 8255, hexcode: 203f
character: ⁀, codepoint: 8256, hexcode: 2040
character: ⁔, codepoint: 8276, hexcode: 2054
character: ・, codepoint: 12539, hexcode: 30fb
character: ︳, codepoint: 65075, hexcode: fe33
character: ︴, codepoint: 65076, hexcode: fe34
character: ﹍, codepoint: 65101, hexcode: fe4d
character: ﹎, codepoint: 65102, hexcode: fe4e
character: ﹏, codepoint: 65103, hexcode: fe4f
character: _, codepoint: 65343, hexcode: ff3f
character: ・, codepoint: 65381, hexcode: ff65

The following compiles on jdk1.6.0_45,

int _, ‿, ⁀, ⁔, ・, ︳, ︴, ﹍, ﹎, ﹏, _, ・ = 0;

Apparently, the above declaration fails to compile on jdk1.7.0_80 & jdk1.8.0_51 for the following two connecting characters (backward compatibility...oops!!!),

character: ・, codepoint: 12539, hexcode: 30fb
character: ・, codepoint: 65381, hexcode: ff65

Anyway, details aside, the exam focuses only on the Basic Latin character set.

Also, for legal identifers in Java, the spec is provided here. Use the Character class APIs to get more details.

Crowd answered 18/8, 2015 at 7:10 Comment(0)
A
2

One of the most, well, fun characters that is allowed in Java identifiers (however not at the start) is the unicode character named "Zero Width Non Joiner" (&zwnj;, U+200C, https://en.wikipedia.org/wiki/Zero-width_non-joiner).

I had this once in a piece of XML inside an attribute value holding a reference to another piece of that XML. Since the ZWNJ is "zero width" it cannot be seen (except when walking along with the cursor, it is displayed right on the character before). It also couldn't be seen in the logfile and/or console output. But it was there all the time: copy & paste into search fields got it and thus did not find the referred position. Typing the (visible part of the) string into the search field however found the referred position. Took me a while to figure this out.

Typing a Zero-Width-Non-Joiner is actually quite easy (too easy) when using the European keyboard layout, at least in its German variant, e.g. "Europatastatur 2.02" - it is reachable with AltGr + ".", two keys which unfortunately are located directly next to each other on most keyboards and can easily be hit together accidentally.

Back to Java: I thought well, you could write some code like this:

void foo() {
    int i = 1;
    int i‌ = 2;
}

with the second i appended by a zero-width-non-joiner (can't do that in the above code snipped in stackoverflow's editor), but that didn't work. IntelliJ (16.3.3) did not complain, but JavaC (Java 8) did complain about an already defined identifier - it seems JavaC actually allows the ZWNJ character as part of an identifier, but when using reflection to see what it does, the ZWNJ character is stripped off the identifier - something that characters like ‿ aren't.

Adherence answered 9/2, 2017 at 8:37 Comment(0)
D
0

The list of characters you can use inside your identifiers (rather than just at the start) is much more fun:

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++)
    if (Character.isJavaIdentifierPart(i) && !Character.isAlphabetic(i))
        System.out.print((char) i + " ");

The list is:

I wanted to post the output, but it's forbidden by the SO spam filter. That's how fun it is!

It includes most of the control characters! I mean bells and stuff! You can make your source code ring the fn bell! Or use characters which will only be displayed sometimes, like the soft hyphen.

Deflocculate answered 2/6, 2016 at 19:45 Comment(1)
It includes \u007f, the DEL character. :-(Babs

© 2022 - 2024 — McMap. All rights reserved.