How do I match "i" with Turkish i in java?
Asked Answered
L

2

9

I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase()); the character i and a dot is printed(this site does not display it properly)

Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?

I've tested normalization with no success.

public static void main(String... a) {
    String iTurkish = "\u0130";//"İ";
    String iEnglish = "I";
    prin(iTurkish);
    prin(iEnglish);
}

private static void prin(String s) {
    System.out.print(s);
    System.out.print(" -  Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

The result is not properly shown in the site but the first line(iTurkish) still has the ̇ near lowercase i.

Purpose and Problem

This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it

Lysander answered 9/6, 2015 at 6:45 Comment(19)
The both letters are not the same uni code so they doesn't match.Provide
You can strip diacritic from string with commons-lang: org.apache.commons.lang3.StringUtils.stripAccents(String)Edrisedrock
@Edrisedrock Wouldn't it prevent differentiation of i from ı ? I would consider it if there is no way to do this.Lysander
@Provide true but they are the same glyph. Isn't the point of normalization matching them?Lysander
I'm not sure what you want to achieve.Edrisedrock
@Edrisedrock I want to write "if" to a JTextArea and the program to select the "İFEL" from a JList. I did the algorithm. It first converts it to lower case to prevent case sensitivity. İFEL becomes i(dot)fel. So the program does not see that "İFEL" starts with "if".Lysander
@Provide I would like to consider it as a last resort. I want the program to be multi lingual. Coding each letter by hand does not seem plausible.Lysander
Is "İFEL" enum value? If yes, you can create the toString(String) and fromString(String) methods, that would match ASCII representation with proper value.Edrisedrock
@Edrisedrock It is a string read from a txt file.Lysander
With the int values of the chars the if works see: ideone.com/ZlUB2rProvide
What is your code now? What about changing it to if (StringUtils.stripAccents(value).startsWith(jTextValue)) ....?Edrisedrock
@Provide charAt(0) matches because iTurkish is 2 chars. An i and a dot. In the link the dot is invisible but when I copied it to netbeans, dot is shown. if (turkNorm.equals(engNorm)) returns false.Lysander
Of course it has two bytes 'cause of the unicode. But if you only want to match the i you can simply check the first byte. Or not?!Provide
@Provide "İFEL".toLowerCase starts with "i" and it works but it does not start with "if" and that is problem.Lysander
@Lysander If you strip the diacritics after normalizing, as suggested, the result should start with "if".Periscope
@Edrisedrock StringUtils doesn't seem to have such a method.Lysander
commons.apache.org/proper/commons-lang/javadocs/api-3.4/org/…Edrisedrock
@Edrisedrock I don't have that library. Where am I supposed to get it?Lysander
commons.apache.org/proper/commons-lang/download_lang.cgiEdrisedrock
P
11

If you print out the hex values of the characters you're seeing, the difference is clear:

İ 0x130 - Normalized : İ 0x49 0x307 - Lower case: i̇ 0x69 0x307 - Lower case Normalized : i̇ 0x69 0x307
I 0x49 - Normalized : I 0x49 - Lower case: i 0x69 - Lower case Normalized : i 0x69

Normalizing the Turkish İ doesn't give you an English I, instead it gives you an English I followed by a diacritic, 0x307. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation for Normalizer mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.

There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's CharMatcher class to strip non-ASCII characters after normalizing, e.g.:

String asciiString = CharMatcher.ascii().retainFrom(normalizedString);

This answer goes into more depth about what \p{InCombiningDiacriticalMarks} does, and why it's not ideal. My CharMatcher solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than the Pattern based approach.

Periscope answered 9/6, 2015 at 7:14 Comment(4)
+1, Interesting side effect "İ".toLowerCase() seems to decide it needs decompose the character. At least here ...Coma
Everybody seems to suggest stripping diacritics. I will probably do it this way. I guess matching "ıf" with "İF" is better than not matching "if" with "İF". Tough I'm not sure if this would be the case.Lysander
@Lysander - as you've presented it, the best solution to your problem is to strip the diacritics. It's possible you have additional requirements you haven't told us about which might merit a different solution. But broadly speaking, if you want someone to be able to type English characters and map them to Turkish ones, you're going to have to strip some information, and you'll be hard pressed to avoid both false positives and false negatives. Your solution should try to minimize whichever is worse for your use case.Periscope
Even though this is the answer that guided me in the right direction, I prefer the code in the Rafiq's linkLysander
R
-1

You can use the code bellow:

public static void main(String... a) {

      String iTurkish = "\u0130";//"İ";
      String iEnglish = "I";
      prin(iTurkish);
      prin(iEnglish);


}

private static void prin(String s) {
    System.out.print(s);
    String nfdNormalizedString = Normalizer.normalize(s, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    System.out.print(" -  Normalized : " + pattern.matcher(nfdNormalizedString).replaceAll(""));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(pattern.matcher(nfdNormalizedString).replaceAll("").toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

Or see Converting Symbols, Accent Letters to English Alphabet

Riegel answered 9/6, 2015 at 7:32 Comment(3)
Not really nice to copy code from Utils class and present here as own.Edrisedrock
Why no vote? I provided the link "https://mcmap.net/q/100874/-converting-symbols-accent-letters-to-english-alphabet symbols-accent- letters- to-english-alphabet" .Did not see it you? "agad"Riegel
+1 for providing a link to the answer and adapting it to the given code. Even though It would be better if you had first provided the link and then clarified that you are using someone else's code.Lysander

© 2022 - 2024 — McMap. All rights reserved.