How do I match "i" with Turkish i in java?

Asked 9/6, 2015 at 6:45 Answered 9/6, 2015 at 7:32

Solved java unicode normalization unicode-normalization

I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase()); the character i and a dot is printed(this site does not display it properly)

Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?

I've tested normalization with no success.

public static void main(String... a) {
    String iTurkish = "\u0130";//"İ";
    String iEnglish = "I";
    prin(iTurkish);
    prin(iEnglish);
}

private static void prin(String s) {
    System.out.print(s);
    System.out.print(" -  Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

The result is not properly shown in the site but the first line(iTurkish) still has the ̇ near lowercase i.

Purpose and Problem

This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it

Lysander answered 9/6, 2015 at 6:45 Comment(19)

The both letters are not the same uni code so they doesn't match. – Provide 9/6, 2015 at 6:49

You can strip diacritic from string with commons-lang: org.apache.commons.lang3.StringUtils.stripAccents(String) – Edrisedrock 9/6, 2015 at 6:50

@Edrisedrock Wouldn't it prevent differentiation of i from ı ? I would consider it if there is no way to do this. – Lysander 9/6, 2015 at 6:52

@Provide true but they are the same glyph. Isn't the point of normalization matching them? – Lysander 9/6, 2015 at 6:52

I'm not sure what you want to achieve. – Edrisedrock 9/6, 2015 at 6:58

@Edrisedrock I want to write "if" to a JTextArea and the program to select the "İFEL" from a JList. I did the algorithm. It first converts it to lower case to prevent case sensitivity. İFEL becomes i(dot)fel. So the program does not see that "İFEL" starts with "if". – Lysander 9/6, 2015 at 7:2

@Provide I would like to consider it as a last resort. I want the program to be multi lingual. Coding each letter by hand does not seem plausible. – Lysander 9/6, 2015 at 7:5

Is "İFEL" enum value? If yes, you can create the toString(String) and fromString(String) methods, that would match ASCII representation with proper value. – Edrisedrock 9/6, 2015 at 7:8

@Edrisedrock It is a string read from a txt file. – Lysander 9/6, 2015 at 7:10

With the int values of the chars the if works see: ideone.com/ZlUB2r – Provide 9/6, 2015 at 7:11

What is your code now? What about changing it to if (StringUtils.stripAccents(value).startsWith(jTextValue)) ....? – Edrisedrock 9/6, 2015 at 7:14

@Provide charAt(0) matches because iTurkish is 2 chars. An i and a dot. In the link the dot is invisible but when I copied it to netbeans, dot is shown. if (turkNorm.equals(engNorm)) returns false. – Lysander 9/6, 2015 at 7:20

Of course it has two bytes 'cause of the unicode. But if you only want to match the i you can simply check the first byte. Or not?! – Provide 9/6, 2015 at 7:24

@Provide "İFEL".toLowerCase starts with "i" and it works but it does not start with "if" and that is problem. – Lysander 9/6, 2015 at 7:27

@Lysander If you strip the diacritics after normalizing, as suggested, the result should start with "if". – Periscope 9/6, 2015 at 7:28

@Edrisedrock StringUtils doesn't seem to have such a method. – Lysander 9/6, 2015 at 7:35

commons.apache.org/proper/commons-lang/javadocs/api-3.4/org/… – Edrisedrock 9/6, 2015 at 7:39

@Edrisedrock I don't have that library. Where am I supposed to get it? – Lysander 9/6, 2015 at 7:46

commons.apache.org/proper/commons-lang/download_lang.cgi – Edrisedrock 9/6, 2015 at 8:13

If you print out the hex values of the characters you're seeing, the difference is clear:

İ 0x130 - Normalized : İ 0x49 0x307 - Lower case: i̇ 0x69 0x307 - Lower case Normalized : i̇ 0x69 0x307
I 0x49 - Normalized : I 0x49 - Lower case: i 0x69 - Lower case Normalized : i 0x69

Normalizing the Turkish İ doesn't give you an English I, instead it gives you an English I followed by a diacritic, 0x307. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation for Normalizer mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.

There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's CharMatcher class to strip non-ASCII characters after normalizing, e.g.:

String asciiString = CharMatcher.ascii().retainFrom(normalizedString);

This answer goes into more depth about what \p{InCombiningDiacriticalMarks} does, and why it's not ideal. My CharMatcher solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than the Pattern based approach.

Periscope answered 9/6, 2015 at 7:14 Comment(4)

+1, Interesting side effect "İ".toLowerCase() seems to decide it needs decompose the character. At least here ... – Coma 9/6, 2015 at 7:16

Everybody seems to suggest stripping diacritics. I will probably do it this way. I guess matching "ıf" with "İF" is better than not matching "if" with "İF". Tough I'm not sure if this would be the case. – Lysander 9/6, 2015 at 7:30

@Lysander - as you've presented it, the best solution to your problem is to strip the diacritics. It's possible you have additional requirements you haven't told us about which might merit a different solution. But broadly speaking, if you want someone to be able to type English characters and map them to Turkish ones, you're going to have to strip some information, and you'll be hard pressed to avoid both false positives and false negatives. Your solution should try to minimize whichever is worse for your use case. – Periscope 9/6, 2015 at 7:42

Even though this is the answer that guided me in the right direction, I prefer the code in the Rafiq's link – Lysander 9/6, 2015 at 18:26

-1

You can use the code bellow:

public static void main(String... a) {

      String iTurkish = "\u0130";//"İ";
      String iEnglish = "I";
      prin(iTurkish);
      prin(iEnglish);


}

private static void prin(String s) {
    System.out.print(s);
    String nfdNormalizedString = Normalizer.normalize(s, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    System.out.print(" -  Normalized : " + pattern.matcher(nfdNormalizedString).replaceAll(""));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(pattern.matcher(nfdNormalizedString).replaceAll("").toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

Or see Converting Symbols, Accent Letters to English Alphabet

Riegel answered 9/6, 2015 at 7:32 Comment(3)

Not really nice to copy code from Utils class and present here as own. – Edrisedrock 9/6, 2015 at 7:33

Why no vote? I provided the link "https://mcmap.net/q/100874/-converting-symbols-accent-letters-to-english-alphabet symbols-accent- letters- to-english-alphabet" .Did not see it you? "agad" – Riegel 9/6, 2015 at 7:41

+1 for providing a link to the answer and adapting it to the given code. Even though It would be better if you had first provided the link and then clarified that you are using someone else's code. – Lysander 9/6, 2015 at 7:59

Recommended topics

Hot tags