Get unicode value of a character
Asked Answered
F

6

80

Is there any way in Java so that I can get Unicode equivalent of any character? e.g.

Suppose a method getUnicode(char c). A call getUnicode('÷') should return \u00f7.

Fortunate answered 8/2, 2010 at 8:42 Comment(1)
Characters are already unicode in Java.Viperine
R
75

You can do it for any Java char using the one liner here:

System.out.println( "\\u" + Integer.toHexString('÷' | 0x10000).substring(1) );

But it's only going to work for the Unicode characters up to Unicode 3.0, which is why I precised you could do it for any Java char.

Because Java was designed way before Unicode 3.1 came and hence Java's char primitive is inadequate to represent Unicode 3.1 and up: there's not a "one Unicode character to one Java char" mapping anymore (instead a monstrous hack is used).

So you really have to check your requirements here: do you need to support Java char or any possible Unicode character?

Roarke answered 8/2, 2010 at 9:7 Comment(5)
Thanks. I have checked all characters with this way and it looks fine for now.Fortunate
The "monstrous hack" is UTF-16, which is widely used. It may not be ideal, but it's well-understood and much better than only supporting UCS-2.Middlebrow
@Joachim: However, having String.charAt now return "half a character" and String.length return something that can be different from the number of characters is ugly, no? (character here meaning Unicode code point, not Java Character) The String class was supposed to be (and was before Unicode 3.1) independent of encoding issues.Cautery
@Joachim: I was referring exactly to what Thilo described. To me the real issue is that to keep backward compatibility we have a method, charAt(...), that does NOT return a character. And that is bad. The method name stayed the same, but it's Javadoc got rewritten. And now we have codePointAt(...) that hardly anyone knows about and anyway the issue is very confusing. Not Java's designers fault per se because, as I wrote in my answer, Java was designed way before Unicode 3.1 came out. It's just kinda sad that char is 16 bit instead of 32.Roarke
Yes, I'm aware of that, and it is a problem. I don't deny that. But at least we've got a well-understood "fix" (or rather workaround) instead of falling into the same encoding-hell that the whole "Oh noes! Many people can't write their language using ASCII" problem produced. UTF-16 is not ideal, but it is standardized and well-understood.Middlebrow
H
41

If you have Java 5, use char c = ...; String s = String.format ("\\u%04x", (int)c);

If your source isn't a Unicode character (char) but a String, you must use charAt(index) to get the Unicode character at position index.

Don't use codePointAt(index) because that will return 24bit values (full Unicode) which can't be represented with just 4 hex digits (it needs 6). See the docs for an explanation.

[EDIT] To make it clear: This answer doesn't use Unicode but the method which Java uses to represent Unicode characters (i.e. surrogate pairs) since char is 16bit and Unicode is 24bit. The question should be: "How can I convert char to a 4-digit hex number", since it's not (really) about Unicode.

Hayrick answered 8/2, 2010 at 9:13 Comment(7)
@Aaron Digulla: it's a common mistake to think that charAt(...) returns a Unicode character. It doesn't. charAt(...) only returns a Unicode character if your String is made of Unicode 3.0/BMP characters. I disagree that he shouldn't use codePointAt. He should use codePointAt and a method that is capable of encoding characters outside the BMP.Roarke
codePointAt would be better, but assuming you really need it, it gets tricky to figure out the correct value for index.Cautery
From the question (4-digit hex), it's clear that Saurabh isn't really interested in real Unicode characters (because they don't fit into 4 hex digits), so using codePointAt() would be wrong.Hayrick
@WizardOfOdds: Do you have a working example how to get the indexes you need to call codePointAt?Hayrick
@Aaron Digulla: the thing is, there's no index magic when calling codePointAt(...). codePointAt(...) always returns a Unicode character, even if it's outside the BMP. It's when calling charAt(...) that you can get into trouble, because if you're calling charAt(...) after a Unicode character outside the BMP, there's no guarantee you'll be reading a character. But maybe I misunderstood you? There are example around with String containing music notes (that are characters outside the BMP) if I recall correctly. But maybe I misunderstood your question?Roarke
@WizardOfOdds: My guess is that the guy asking will always convert a whole string, so charAt() is safe. But I get your point: You can loop over 0 to s.codePointCount(0,s.length()) and then call s.codePointAt() for each value of the iterator.Hayrick
hi i m saurabh(ranu) please provide me details of hibernate full text search and configuration of it......Amnion
R
14
private static String toUnicode(char ch) {
    return String.format("\\u%04x", (int) ch);
}
Revealment answered 7/8, 2013 at 8:20 Comment(3)
Copies an existing answer from 3 years previous.Tanguy
yet, gives much more clearer answer then the best answer i mean what the heck is this ( "\\u" + Integer.toHexString('÷' | 0x10000).substring(1) )Adumbrate
"\\u" + String.format("%04x", (int) c).toUpperCase()Iota
G
10
char c = 'a';
String a = Integer.toHexString(c); // gives you---> a = "61"
Geffner answered 11/6, 2014 at 14:29 Comment(0)
V
1

I found this nice code on web.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

public class Unicode {

public static void main(String[] args) {
System.out.println("Use CTRL+C to quite to program.");

// Create the reader for reading in the text typed in the console. 
InputStreamReader inputStreamReader = new InputStreamReader(System.in);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);

try {
  String line = null;
  while ((line = bufferedReader.readLine()).length() > 0) {
    for (int index = 0; index < line.length(); index++) {

      // Convert the integer to a hexadecimal code.
      String hexCode = Integer.toHexString(line.codePointAt(index)).toUpperCase();


      // but the it must be a four number value.
      String hexCodeWithAllLeadingZeros = "0000" + hexCode;
      String hexCodeWithLeadingZeros = hexCodeWithAllLeadingZeros.substring(hexCodeWithAllLeadingZeros.length()-4);

      System.out.println("\\u" + hexCodeWithLeadingZeros);
    }

  }
} catch (IOException ioException) {
       ioException.printStackTrace();
  }
 }
}

Original Article

Vue answered 8/2, 2010 at 8:45 Comment(5)
Thanks. You give me what I have asked. However, when I am trying some Russian characters, it returns same Unicode value. I think the Unicode value should be different for different characters. I have tried following characters - л, и, ц, т, я retuns \u003F.Fortunate
I'm pretty sure that piece of code isn't correct for codepoints above 0xFFFF.Roarke
Russian characters should be on the Basic Multilingual Plane, though (below 0xFFFF).Cautery
@Thilo: oh I know, I wasn't commenting on Saurabh's russian example. I tried his characters with my method before posting the comment and they work fine. I was just stating that I'm pretty sure the method there ain't working with chars outside the BMP.Roarke
It's amazing how much code someone must write to solve a simple problem. Aaron's solution was 40 characters long. Here we have 1124.Bibliotheca
J
1

are you picky with using Unicode because with java its more simple if you write your program to use "dec" value or (HTML-Code) then you can simply cast data types between char and int

char a = 98;
char b = 'b';
char c = (char) (b+0002);

System.out.println(a);
System.out.println((int)b);
System.out.println((int)c);
System.out.println(c);

Gives this output

b
98
100
d
Jairia answered 26/2, 2015 at 3:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.