Java unicode where to find example N-byte unicode characters

P

4

6

I'm looking for sample 1-byte, 2-byte, 3-byte, 4-byte, 5-byte, and 6-byte unicode characters. Any links to some sort of reference of all the different unicode characters out there and how big they are (byte-wise) would be greatly appreciated. I'm hoping this reference also has code points like \uXXXXX.

Polemoniaceous answered 19/5, 2011 at 18:23 Comment(0)

A

3

Check this out: http://en.wikipedia.org/wiki/List_of_Unicode_characters.
Also this: http://www.unicode.org/charts/.

Autophyte answered 19/5, 2011 at 18:30 Comment(5)

These don't tell me how many bytes those code points represent. Where do I find this? – Polemoniaceous 19/5, 2011 at 18:39

@Mohamed: look at the UTF-8, Design section article on Wikipedia. It will give you a correspondence between the Unicode codepoint value and its length in UTF-8 representation. That's the only encoding that has more than four chars. – Christianna 19/5, 2011 at 18:43

so in other words, there are no 5+ byte utf-8 characters? According to the wikipedia article, they stopped at 4. – Polemoniaceous 19/5, 2011 at 19:43

@Mohamed, yes, that's correct. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. – Autophyte 19/5, 2011 at 19:56

@Mohamed: perl -CS -e 'print chr(0x101)' | wc -c prints 2, perl -CS -e 'print chr(0x1020)' | wc -c prints 3, perl -CS -e 'print chr(0x1F608) | wc -c prints 4. Those answers are in bytes for UTF-8 encoding those respective code points. The highest legal Unicode code point is 0x10FFFF, but UTF-8 can encode larger code points that that. For example, on a 64-bit machine: perl -CS -e 'print chr(0xFFFF_FFFF_FFFF_FFFF)' | wc -c reports 13 bytes. – Eden 20/5, 2011 at 13:26

K

8

There is no such thing as "1-byte, 2-byte, 3-byte, 4-byte, 5-byte, and 6-byte unicode characters".

You probably talk about UTF-8 representations of Unicode characters. Similarly, strings in Java are internally represented in UTF-16, so that Java char type represents a 16-bit code unit of UTF-16, and each Unicode character can be represented by either one or two these code units, and each code unit can be represented as \uxxxx in string literals (note that there are only 4 hex digits in these sequences, since code units are 16-bit long).

So, if you need a reference of Unicode characters with their UTF-8 and UTF-16 representations, you can take a look at the table at fileformat.info.

See also:

Kiddush answered 19/5, 2011 at 18:43 Comment(2)

Thanks this is a great start!! – Polemoniaceous 19/5, 2011 at 18:54

@Mohamed Nuur You could also look at these two sites:unicode character table which has a neat lookup feature using descriptive names or, if you are interested in the basic ASCII set, try lookup tables – Forced 3/8, 2016 at 6:28

L

8

As axtavt points out, the concept of n-byte Unicode characters is meaningless; assuming you mean UTF-8, then a very simple table, which might help you with testing etc, might be as follows. Note that all example characters work on my browser (Chrome on Ubuntu) but your mileage may vary in terms of displaying, copying/pasting, etc.

UTF-8 bytes  Start    End       Example Character
1            U+0000   U+007F    ! EXCLAMATION MARK U+0021)
2            U+0080   U+07FF    ¶ PILCROW SIGN (U+00B6)
3            U+0800   U+FFFF    ‱ PER TEN THOUSAND SIGN (U+2031)
4            U+10000  U+1FFFFF  𝅘𝅥𝅯 MUSICAL SYMBOL SIXTEENTH NOTE (U+1D161)

In theory there can be 5- or 6- byte values in UTF-8, but Unicode's 32-bit address space is limited in reality to a max of 10FFFF so more than 4 bytes aren't required.

Note that there's an important caveat here: Java's char is not a Unicode character; it's a 16-bit code unit of UTF-16, and it is not uncommon to see data streams which treat a non-BMP character (like U+1D161 above) as 2 characters, and UTF-8 it accordingly. For example:

Character: U+1D161
UTF-8 encoding: 0xF0 0x9D 0x85 0xA1
UTF-16 encoding: 0xD834 0xDD61
UTF-16 code points individually encoded as UTF-8: 0xED 0xA0 0xB4 0xED 0xB5 0xA1

Note that this has the effect of apparently showing a 6-byte UTF-8 character, but this is in fact not permitted by UTF-8. UTF-8 must be the encoding of the original code points, not the encoding of the UTF-16 code units which represents those points. This doesn't mean you don't see it in the wild though...

Lowis answered 20/5, 2011 at 0:57 Comment(4)

It is incorrect, broken, and stupid to take one code point that occupies two UTF-16 chunks and make two UTF-8 chunks out of it. You need to decode it back to a single code point and generate a single code point in return. OTHERWISE YOU GET THE WRONG ANSWER – Eden 20/5, 2011 at 2:5

The thing is, that is not UTF-8 when they do that. It's CESU-8, which is a blunder so common that The Unicode Standard was forced to mention it. It is not a UTF, so should never be used for external data exchange. In particular, It is not intended nor recommended as an encoding used for open information exchange. It is a mistake, one of those dumb things that Windows and/or Java people who aren't paying attention tend to screw up. – Eden 20/5, 2011 at 12:35

No, it's not valid UTF-8. Which I said in my answer. You seem to be very violently agreeing with me. Didn't know about the CESU-8 TR though, good piece of info, thanks. – Lowis 21/5, 2011 at 4:13

Of course, it doesn't help that Java has a "writeUTF" method that writes that rubbish. :) – Knox 12/10, 2016 at 4:51

A

3

Check this out: http://en.wikipedia.org/wiki/List_of_Unicode_characters.
Also this: http://www.unicode.org/charts/.

Autophyte answered 19/5, 2011 at 18:30 Comment(5)

These don't tell me how many bytes those code points represent. Where do I find this? – Polemoniaceous 19/5, 2011 at 18:39

@Mohamed: look at the UTF-8, Design section article on Wikipedia. It will give you a correspondence between the Unicode codepoint value and its length in UTF-8 representation. That's the only encoding that has more than four chars. – Christianna 19/5, 2011 at 18:43

so in other words, there are no 5+ byte utf-8 characters? According to the wikipedia article, they stopped at 4. – Polemoniaceous 19/5, 2011 at 19:43

@Mohamed, yes, that's correct. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. – Autophyte 19/5, 2011 at 19:56

@Mohamed: perl -CS -e 'print chr(0x101)' | wc -c prints 2, perl -CS -e 'print chr(0x1020)' | wc -c prints 3, perl -CS -e 'print chr(0x1F608) | wc -c prints 4. Those answers are in bytes for UTF-8 encoding those respective code points. The highest legal Unicode code point is 0x10FFFF, but UTF-8 can encode larger code points that that. For example, on a 64-bit machine: perl -CS -e 'print chr(0xFFFF_FFFF_FFFF_FFFF)' | wc -c reports 13 bytes. – Eden 20/5, 2011 at 13:26

C

1

For those who are after just the actual samples Here are 4 samples.

a (1 bytes, 0x61)
µ (2 bytes, 0xb5)
→ (3 bytes, 0x2192)
🐱 (4 bytes, 0x1f431)

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ee0883bad3b1204f45889ea450a53cf4

I am not entirely sure why 0xb5 is 2 bytes and 0x2192 is three. Perhaps someone can explain.

Carthy answered 3/4, 2021 at 13:48 Comment(0)

Recommended topics

Hot tags