Converting Symbols, Accent Letters to English Alphabet

Asked 17/6, 2009 at 18:31 Answered 26/6, 2017 at 10:50

Solved java unicode special-characters diacritics

145

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.

For instance here are a few conversions:

ҥ->H
Ѷ->V
Ȳ->Y
Ǭ->O
Ƈ->C
tђє Ŧค๓เℓy --> the Family
...

and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.

The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.

How can I convert all these with Java? Please help me :(

Kistna answered 17/6, 2009 at 18:31 Comment(9)

See this question: #249587 - there should also be some other questions about this topic, but I can't find them at the moment. – Surmullet 17/6, 2009 at 18:36

Should your third example be Ȳ → Y? – Decay 17/6, 2009 at 19:42

Why do you want to do this? If we knew what your overall goal was, we might be able to be more helpful. – Counterplot 17/6, 2009 at 20:1

David you know some EMOs use different chars in sentences. Here you an example: ฬ.¢. tђє ฬยη∂єг¢คקђ Ŧค๓เℓy <-- Solve this :) @schnaader, I think that is what I'm looking for but not in Java. – Kistna 17/6, 2009 at 20:4

This conversation has been done before - see @Surmullet above. – Psycho 17/6, 2009 at 20:7

I said that I'm looking for something in Java. – Kistna 17/6, 2009 at 20:14

Related (not necessarily duplicate) Java question: 'Method to substitute foreign for English characters in Java?', #1017455 – Ultrasonics 27/6, 2009 at 12:36

look for Unihandecode – Threewheeler 4/6, 2015 at 16:59

renenyffenegger.ch/development/Unicode/… – Hoelscher 27/7, 2017 at 13:6

211

Reposting my post from How do I remove diacritics (accents) from a string in .NET?

This method works fine in java (purely for the purpose of removing diacritical marks aka accents).

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

Octoroon answered 31/7, 2009 at 22:6 Comment(9)

InCombiningDiacriticalMarks doesn't convert all cyrillics. For example Општина Богомила is untouched. It would be nice if one could convert it to Opstina Bogomila or something – Tracheitis 14/5, 2010 at 15:47

It doesn't transliterate at all. It merely removes decomposed diacritical marks ("accents"). The previous step (Form.NFD) breaks down á in a + ', i.e. decomposing the accented character into an unaccented character plus a diacritical mark. This would convert cyrillic Ѽ into Ѡ but not further. – Fatten 28/7, 2010 at 10:44

George posted that it could be better use the \\p{IsM} instead of \\p{InCombiningDiacriticalMarks} at glaforge.appspot.com/article/… Note that I have not tested it. – Flush 26/3, 2012 at 9:42

\\p{IsM} does not seem to work for spanish accents like á ó ú ñ é í . On the contrary, "\\p{InCombiningDiacriticalMarks}+ is working good for this – Virg 5/3, 2013 at 9:23

It doesn't work for all special characters - I submitted a wrong issue for Android for that to learn that -> code.google.com/p/android/issues/detail?id=189515 Anybody know correct way to do this? – Jacquline 11/1, 2016 at 17:50

@Tajchert Your issue was invalid, since Ł cannot be decomposed. What's wrong is not the normalizer, but using it to strip accents. – Luci 3/2, 2016 at 20:5

@KarolS that is why I had wrote " I submitted a wrong issue" as I know this was not a bug itself but using not correct class to this function. Which is done in this answer. – Jacquline 12/2, 2016 at 18:12

this can't convert ı to i in Turkish – Riendeau 26/6, 2016 at 1:23

It doesn't work for all characters bro. For example, my input is "Xin chào, chúng ta sẽ đi tới Việt Nam" then output is "Xin chao, chung ta se đi toi Viet Nam", pls pay attention to letter "đ" – Wyoming 26/10, 2017 at 7:36

It's a part of Apache Commons Lang as of ver. 3.0.

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/

Opportunist answered 3/11, 2012 at 13:28 Comment(5)

This solution is amazing. It works with Greek too! Thank you. – Waldron 25/9, 2014 at 18:49

It's not perfect for Polish characters translation from ł and Ł is missing: input: ŚŻÓŁĄĆĘŹąółęąćńŃ output: SZOŁACEZaołeacnN – Vancevancleave 21/8, 2016 at 11:3

Nice utility but since its code is exactly the same as the one showed in the accepted answer, and you don't want to add a dependency on Commons Lang, you can just use the aforementioned snippet. – Chancellor 24/1, 2017 at 15:51

with apache common in my case: Đ not convert to D – Riddell 20/9, 2017 at 7:7

@Hoang, Robert maybe a chance to send a pull request :) – Rahm 20/9, 2017 at 16:24

Attempting to "convert them all" is the wrong approach to the problem.

Firstly, you need to understand the limitations of what you are trying to do. As others have pointed out, diacritics are there for a reason: they are essentially unique letters in the alphabet of that language with their own meaning / sound etc.: removing those marks is just the same as replacing random letters in an English word. This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic, which simply cannot be "converted" to English.

If you must, for whatever reason, convert characters, then the only sensible way to approach this it to firstly reduce the scope of the task at hand. Consider the source of the input - if you are coding an application for "the Western world" (to use as good a phrase as any), it would be unlikely that you would ever need to parse Arabic characters. Similarly, the Unicode character set contains hundreds of mathematical and pictorial symbols: there is no (easy) way for users to directly enter these, so you can assume they can be ignored.

By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup / replace operation is feasible. It then becomes a small amount of slightly boring work creating the dictionaries, and a trivial task to perform the replacement. If your language supports native Unicode characters (as Java does) and optimises static structures correctly, such find and replaces tend to be blindingly quick.

This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters. The lookup arrays (as it was in our case) took perhaps 1 man day to produce, to cover all diacritic marks for all Western European languages.

Dominy answered 17/6, 2009 at 20:18 Comment(1)

iAn thanks for answering. Actually I'm not working with arabic languages or something like that. You know some people use the diacritics as funny characters and I have to remove that as much as I can do. For instance, I said "tђє Ŧค๓เℓy --> the Family" conversion in the example but it seems difficult convert it completely. However, we can make the conversion "òéışöç->oeisoc" in a simple way. But what is the exact way to do this. Creating arrays and replacing manually? Or does this language have native functions about this issue? – Kistna 17/6, 2009 at 20:28

Since the encoding that turns "the Family" into "tђє Ŧค๓เℓy" is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved, there's no general way to solve this algorithmically.

You will need to build the mapping of Unicode characters into latin characters which they resemble. You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints. But I think the effort for this would be greater than manually building that mapping. Especially if you have a good amount of examples from which you can build your mapping.

To clarify: a few of the substitutions can actually be solved via the Unicode data (as the other answers demonstrate), but some letters simply have no reasonable association with the latin characters which they resemble.

Examples:

"ђ" (U+0452 CYRILLIC SMALL LETTER DJE) is more related to "d" than to "h", but is used to represent "h".
"Ŧ" (U+0166 LATIN CAPITAL LETTER T WITH STROKE) is somewhat related to "T" (as the name suggests) but is used to represent "F".
"ค" (U+0E04 THAI CHARACTER KHO KHWAI) is not related to any latin character at all and in your example is used to represent "a"

Hersch answered 9/9, 2009 at 8:50 Comment(0)

String tested : ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß

Tested :

Output from Apache Commons Lang3 : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from ICU4j : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from JUnidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUUss (problem with Ý and another issue)
Output from Unidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUYss

The last choice is the best.

Unwished answered 12/4, 2017 at 13:23 Comment(2)

@mehmet Just follow the readme at github.com/xuender/unidecode. It should be something like Unidecode.decode("ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß") after importing the dependency. – Unwished 17/8, 2018 at 8:22

This is an interesting test. But it would be even better if you wrote out which methods from the different libraries you are using! – Chenoweth 15/11, 2021 at 20:2

The original request has been answered already.

However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.

Naive meaning of tranliteration: Translated string in it's final form/target charset sounds like the string in it's original form. If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.

Here is the code snippet in java:

    import com.ibm.icu.text.Transliterator; //ICU4J library import

    public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
    public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";

    /**
    * Returns the transliterated string to convert any charset to latin.
    */
    public static String transliterate(String input) {
        Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
        String result = transliterator.transliterate(input);
        return result;
    }

Abm answered 10/11, 2014 at 6:13 Comment(0)

If the need is to convert "òéışöç->oeisoc", you can use this a starting point :

public class AsciiUtils {
    private static final String PLAIN_ASCII =
      "AaEeIiOoUu"    // grave
    + "AaEeIiOoUuYy"  // acute
    + "AaEeIiOoUuYy"  // circumflex
    + "AaOoNn"        // tilde
    + "AaEeIiOoUuYy"  // umlaut
    + "Aa"            // ring
    + "Cc"            // cedilla
    + "OoUu"          // double acute
    ;

    private static final String UNICODE =
     "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"             
    + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD" 
    + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177" 
    + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
    + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF" 
    + "\u00C5\u00E5"                                                             
    + "\u00C7\u00E7" 
    + "\u0150\u0151\u0170\u0171" 
    ;

    // private constructor, can't be instanciated!
    private AsciiUtils() { }

    // remove accentued from a string and replace with ascii equivalent
    public static String convertNonAscii(String s) {
       if (s == null) return null;
       StringBuilder sb = new StringBuilder();
       int n = s.length();
       for (int i = 0; i < n; i++) {
          char c = s.charAt(i);
          int pos = UNICODE.indexOf(c);
          if (pos > -1){
              sb.append(PLAIN_ASCII.charAt(pos));
          }
          else {
              sb.append(c);
          }
       }
       return sb.toString();
    }

    public static void main(String args[]) {
       String s = 
         "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
       System.out.println(AsciiUtils.convertNonAscii(s));
       // output : 
       // The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
    }
}

The JDK 1.6 provides the java.text.Normalizer class that can be used for this task.

Kiwi answered 17/6, 2009 at 22:33 Comment(2)

Unfortunately that will not handle ligatures like Æ. – Decay 17/6, 2009 at 23:7

This method is particularly useful if you need to detect and handle classes of diacritics differently (i.e., escaping special characters in LaTeX). – Striped 1/6, 2018 at 13:4

The problem with "converting" arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent. For example, “ß” to a German-speaking person should be converted to "ss" while an English-speaker would probably convert it to “B”.

Add to that the fact that Unicode has multiple code points for the same glyphs.

The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to. You can take a shortcut by normalizing characters with accents to normalization form KD, but not all characters normalize to ASCII. In addition, Unicode does not define which parts of a glyph are "accents".

Here is a tiny excerpt from an app that does this:

switch (c)
{
    case 'A':
    case '\u00C0':  //  À LATIN CAPITAL LETTER A WITH GRAVE
    case '\u00C1':  //  Á LATIN CAPITAL LETTER A WITH ACUTE
    case '\u00C2':  //  Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
    // and so on for about 20 lines...
        return "A";
        break;

    case '\u00C6'://  Æ LATIN CAPITAL LIGATURE AE
        return "AE";
        break;

    // And so on for pages...
}

Decay answered 17/6, 2009 at 18:49 Comment(4)

I agree. You should create a dictionary of conversions specifically for your application and expected audience. For example, for a Spanish-speaking audience I would only translate ÁÉÍÓÚÜÑáéíóúü¿¡ – Jewel 17/6, 2009 at 19:23

Roberto there are thousands of characters and I can't do this manual. – Kistna 17/6, 2009 at 19:58

What human language are you using that has "thousands" of characters? Japanese? What would you expect どうしようとしていますか to be converted to? – Decay 17/6, 2009 at 20:14

The example you've given is not ideal: U+00DF LATIN SMALL LETTER SHARP S "ß" is not the same Unicode letter as U+03B2 GREEK SMALL LETTER BETA "β". – Hersch 9/9, 2009 at 8:45

You could try using unidecode, which is available as a ruby gem and as a perl module on cpan. Essentially, it works as a huge lookup table, where each unicode code point relates to an ascii character or string.

Aristotelian answered 17/6, 2009 at 19:14 Comment(4)

You might be able to get a lookup table from one of these. – Seljuk 17/6, 2009 at 21:28

This is an amazing package, but it transliterates the sound of the character, for example it converts "北" to "Bei" because that is what the character sounds like in Mandarin. I think the questioner wants to convert glyphs to what they visually resemble in English. – Decay 17/6, 2009 at 23:57

It does do that for latin characters, though. â becomes a, et al. @ahmetalpbalkan I agree with Kathy, you could use it as a resource to build your own lookup table, the logic should be pretty simple. Unfortuantely there doesn't seem to be a java version. – Aristotelian 18/6, 2009 at 0:15

@ahmetalpbalkan Here is unidecode for Java. – Sanmicheli 9/7, 2015 at 23:8

There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to. They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter.

If you want that conversion, you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to.

(If you only want to remove diacritial marks, there are some answers in this thread: How do I remove diacritics (accents) from a string in .NET? However you describe a more general problem)

Earthworm answered 27/6, 2009 at 12:4 Comment(1)

+1. Here's a Java version of the 'remove diacritics' question: #1017455; see Michael Borgwardt's and devio's answers – Ultrasonics 29/6, 2009 at 7:45

I'm late to the party, but after facing this issue today, I found this answer to be very good:

String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
    .replaceAll("[^\\p{ASCII}]", "");

Reference: https://stackoverflow.com/a/16283863

Devotion answered 14/8, 2016 at 22:11 Comment(2)

Small warning - it removes U+00DF LATIN SMALL LETTER SHARP S "ß" – Kat 18/1, 2017 at 10:53

And also Æ... To bad. – Unwished 12/4, 2017 at 12:42

Following Class does the trick:

org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter

Hasp answered 26/6, 2017 at 10:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags