Is there a way to get rid of accents and convert a whole string to regular letters?

Asked 23/7, 2010 at 20:33 Answered 23/8, 2022 at 21:47

324

Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one? Example:

Input: orčpžsíáýd

Output: orcpzsiayd

It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.

Fritillary answered 23/7, 2010 at 20:33 Comment(0)

460

Use java.text.Normalizer to handle this for you.

string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction

This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

string = string.replaceAll("[^\\p{ASCII}]", "");

If your text is in unicode, you should use this instead:

string = string.replaceAll("\\p{M}", "");

For unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.

Thanks to GarretWilson for the pointer and regular-expressions.info for the great unicode guide.

Melodic answered 23/7, 2010 at 20:38 Comment(8)

Works well but slow for some needs, see my comment lower for faster solution with some limitations which may not be an issue. For a few normalizations I'd definitely use this answer of course, because it's much cleaner, more general and without the need or your own code. – Capias 31/5, 2012 at 10:23

This compiles the regular expression each time, which is fine if you only need it once, but if you need to do this with a lot of text, pre-compiling the regex is a win. – Unitary 3/3, 2013 at 21:49

Note that not all Latin-based letters decompose to ASCII+accents. This will kill eg. "Latin {capital,small} letter l with stroke" used in Polish. – Candler 18/6, 2013 at 9:5

This is a good approach, but removing all non-ASCII characters is overkill and will probably remove things you don't want, as others have indicated. It would be better to remove all Unicode "marks"; including non-spacing marks, spacing/combining marks, and enclosing marks. You can do this with string.replaceAll("\\p{M}", ""). See regular-expressions.info/unicode.html for more information. – Koosis 9/1, 2014 at 0:48

You probably want to use Normalizer.Form.NFKD rather than NFD - NFKD will convert things like ligatures into ascii characters (eg ﬁ to fi), NFD will not do this. – Bosomed 15/2, 2017 at 0:57

@chesterm8, interestingly NFKD is converting "ﬁ" to "fi", but it's not converting "Æ" to "AE". I guess I'll have to bring up the Unicode data to find out why, but it wasn't what I expected. – Koosis 12/10, 2018 at 18:59

@GarretWilson replaceAll("\\p{M}") can only be used with JAVA 8 and on? I was reading on the link regular-expressions.info/unicode.html – Toomey 28/2, 2020 at 2:17

Anyone that wants a Kotlin extension solution: fun String.normalize(): String = Normalizer.normalize(this, Normalizer.Form.NFD).replace(Regex("\\p{M}"), "") – Guth 24/1, 2023 at 19:36

206

As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0):

    String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ");
    System.out.println(input);
    // Prints "This is a funky String"

Note:

The accepted answer (Erick Robertson's) doesn't work for Ø or Ł. Apache Commons 3.5 doesn't work for Ø either, but it does work for Ł. After reading the Wikipedia article for Ø, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.

Inaudible answered 5/1, 2015 at 23:53 Comment(2)

If you don't want to include the library you can take the two methods involved in that feature easily from source at commons.apache.org/proper/commons-lang/apidocs/src-html/org/… – Monitory 3/5, 2017 at 12:0

As a Dane, the Danish/Norwegian ø just as the French œ and the German/Swedish/Hungarian/Estonian etc. ö originates as a short way to write oe. So depending on your purpose this may be the substitution you want. – Tyne 1/5, 2019 at 9:13

The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:

import java.text.Normalizer;

public class Strip {
    public static String flattenToAscii(String string) {
        StringBuilder sb = new StringBuilder(string.length());
        string = Normalizer.normalize(string, Normalizer.Form.NFD);
        for (char c : string.toCharArray()) {
            if (c <= '\u007F') sb.append(c);
        }
        return sb.toString();
    }
}

Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    string = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = string.length(); i < n; ++i) {
        char c = string.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}

This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).

Unitary answered 3/3, 2013 at 22:9 Comment(6)

out must be resized to match the number of valid characters j before it is used to construct the string object. – Kalina 17/5, 2015 at 16:35

I have an objection to this solution. Imagine input "æøåá". Current flattenToAscii creates result "aa.." where dots represent \u0000. That is not good. First question is - how to represent "unnormalizable" characters? Let's say it will be ?, or we can leave NULL char there, but in any case we have to preserve the correct position of these (just like regex solution does). For this the if in the loop must be something like: if (c <= '\u007F') out[j++] = c; else if (Character.isLetter(c)) out[j++] = '?'; It will slow it down a bit, but it must be correct in the first place. ;-) – Capias 17/8, 2015 at 9:28

Ad my last comment (too bad they can't be longer) - maybe positive take (isLetter) is not the right one, but I didn't find better. I'm not Unicode expert, so I don't know how to better identify the class of the single character that replaces original character. Letters work OK for most applications/usages. – Capias 17/8, 2015 at 9:44

Finally, this solution (with fix) does not produce the same output as regex version. That's because regex version leaves this kind of characters (like ø) there as-is. In this sense this answer at least does not leave any non-ascii characters there (which is expected result) even in corner cases like these. So in the end this seems to be the most correct solution. Of course, with my suggested fix applied, so the positions of the letters are correct, whatever the replacement character (?) is going to be. – Capias 17/8, 2015 at 9:56

You probably want to use Normalizer.Form.NFKD rather than NFD - NFKD will convert things like ligatures into ascii characters (eg ﬁ to fi), NFD will not do this. – Bosomed 15/2, 2017 at 0:57

For us we wanted to remove the character altogether. To ensure there wasn't trailing null characters I removed them with an alternative String constructor: return new String(out, 0, j); – Bottle 24/8, 2018 at 15:14

EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer (introduced in Java 6) instead of translation table inside the loop.

While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):

    /**
 * Mirror of the unicode table from 00c0 to 017f without diacritics.
 */
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
    "DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
    "aaaaaaaceeeeiiii" +
    "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
    "AaAaAaCcCcCcCcDd" +
    "DdEeEeEeEeEeGgGg" +
    "GgGgHhHhIiIiIiIi" +
    "IiJjJjKkkLlLlLlL" +
    "lLlNnNnNnnNnOoOo" +
    "OoOoRrRrRrSsSsSs" +
    "SsTtTtTtUuUuUuUu" +
    "UuUuWwYyYZzZzZzF";

/**
 * Returns string without diacritics - 7 bit approximation.
 *
 * @param source string to convert
 * @return corresponding string without diacritics
 */
public static String removeDiacritic(String source) {
    char[] vysl = new char[source.length()];
    char one;
    for (int i = 0; i < source.length(); i++) {
        one = source.charAt(i);
        if (one >= '\u00c0' && one <= '\u017f') {
            one = tab00c0.charAt((int) one - '\u00c0');
        }
        vysl[i] = one;
    }
    return new String(vysl);
}

Tests on my HW with 32bit JDK show that this performs conversion from àèéľšťč89FDČ to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.

Enjoy :-)

Capias answered 31/5, 2012 at 10:20 Comment(11)

A lot of the slowness of the suggested version is due to the regular expression, not the Normalizer. Using Normalizer but removing the non-ASCII characters 'by hand' is faster, although still not as fast as your version. But it works for all of Unicode instead of just latin1 and latin2. – Unitary 3/3, 2013 at 21:51

I expanded this to work with more characters, pastebin.com/FAAm6a2j, Note it won't work correctly with multichar characters such as Ǆ (DZ). It will only produce 1 character from it. Also my function uses char instead of strings, which is quicker IF you're handling char anyways, so you dont have to convert. – Arnaud 5/3, 2013 at 14:47

Hey I don't understand what are those letters on tab00c0 field stand for? for example "AAAAAAACEEEEIIII" or "lLlNnNnNnnNnOoOo" etc. Never seen them before. Where did you find them? Also Why don't you just use the coresponding codes? – Hollyanne 8/12, 2014 at 4:40

@ThanosF just try to go through the code (with debugger if needed). What this does is for every character in a string: "Is this character between \u00c0 and \u017f? If so, replace it with 7bit ASCII character from the table." Table just covers two encoding pages (Latin 1 and 2) with their 7bit equivalents. So if it's character with code \u00e0 (à) it will take its 7bit approximation from 32nd position of the table (e0-c0=32) - that is "a". Some characters are not letters, those are left there with their code. – Capias 8/12, 2014 at 10:15

Thanks for your explanation. Where can I find those encoding pages so that I can extend this Variable to my language? (Greek) Accepted answer already does the job replacing greek accented letters but I wanted to try your method too and run some benchmarks :) – Hollyanne 8/12, 2014 at 18:34

@ThanosF If you google Unicode table you'll find the table - better yet though go to en.wikipedia.org/wiki/Greek_alphabet#Greek_in_Unicode and read that. It seems that you have two tables to cover that are rather far away from each other. You may create one big table - it would ~8k of characters. That's not the problem, but then you'd potentially replace chars you don't want to define, unless you'd agree on some "don't replace char" (space would be convenient). Only if it's not a space you'd assign it to "one" variable. Or add more ifs with different tables and offsets. :-) – Capias 8/12, 2014 at 20:47

thanks! The second table is ancient Greek that it's not used in every day life so I won't need this – Hollyanne 8/12, 2014 at 20:52

This solution does not convert char "Ệệ" into "Ee". – Ascospore 21/2, 2017 at 5:4

@Ascospore Are these in Latin 1 or 2? It seems these are U+1ec6/7, which is way beyond the range of my solution. So yes, it does not support these and I clearly state so in the answer. – Capias 22/2, 2017 at 6:14

@Capias thanks for answer. It was 0x1EFF, in the range 0x1E00 - 0x1EFF : Latin Extended Additional. So, can we make any modification to make this work with Latin Extended Addtional? – Ascospore 23/2, 2017 at 4:3

@Ascospore You need to define another table and add another if block (3 lines), that should do it. There are probably more sophisticated ways how to design it, but this should cover another continuous section of characters. I leave that exercise to those who need it, of course. :-) – Capias 24/2, 2017 at 23:11

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));

worked for me. The output of the snippet above gives "aee" which is what I wanted, but

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));

didn't do any substitution.

Liquidity answered 19/11, 2010 at 14:2 Comment(4)

Confirming this... normally ASCII works just fine, but I encountered this problem on Linux (64b) with JRockit (1.6.0_29 64b). Can't confirm it with any other setup, can't confirm that corellation, but I can confirm that the other suggested solution worked and for that I vote this one up. :-) (BTW: It did some replacement, but not enough, it changed Ú to U for instance, but not á to a.) – Capias 7/6, 2012 at 13:26

You probably want to use Normalizer.Form.NFKD rather than NFD - NFKD will convert things like ligatures into ascii characters (eg ﬁ to fi), NFD will not do this. – Bosomed 15/2, 2017 at 0:57

@KarolS I don't see either of them containing any accents – Colvert 31/5, 2018 at 5:55

@Colvert A slash across a letter counts as a diacritic: en.wikipedia.org/wiki/Diacritic And if you go by a stricter definition of an "accent" as on that Wikipedia page, then diaeresis is not an accent, so Nico's answer is still wrong. – Peterec 31/5, 2018 at 22:23

Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks

https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

"Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."

Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.

Visceral answered 23/7, 2010 at 20:41 Comment(5)

Agreed. For example in swedish: "höra" (hear) -> "hora" (whore) – Moslemism 5/10, 2010 at 7:8

It doesn't matter what they mean. The question is how to remove them. – Melodic 21/10, 2010 at 14:41

Erick: It matters what they're called. If the question asks how to remove accents, and if those aren't accents, then the answer may not be just how to remove all of those things that look like accents. Though this should probably be a comment and not an answer. – Literati 24/10, 2013 at 16:55

I think the normal use case for this is search, particularly search of mixed languages, often with an English keyboard as input, in which case it's better to get false positives than false negatives. – Shelly 19/9, 2014 at 14:28

@Literati Whether it matters or not what they're called, Erick is right, it's not relevant as an answer whatsoever since it doesn't even attempt to address the question asked. Should be a comment. – Upcast 12/3, 2022 at 10:15

I have faced the same issue related to Strings equality check, One of the comparing string has ASCII character code 128-255.

i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes()));
System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes()));
Output in Bytes:

S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]

Use below code for Different Spaces and their Byte-Codes: wiki for List_of_Unicode_characters

String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray = 
    // spacing_entities.getBytes( Charset.forName("UTF-8") );
    // Charset.forName("UTF-8").encode( s2 ).array();
    {-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
    System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

➩ ASCII transliterations of Unicode string for Java. unidecode
```
String initials = Unidecode.decode( s2 );
```

➩ using Guava: Google Core Libraries for Java.

String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );

For URL encode for the space use Guava laibrary.

String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);

➩ To overcome this problem used String.replaceAll() with some RegularExpression.

// \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
s2 = s2.replaceAll("\\p{Zs}", " ");


s2 = s2.replaceAll("[^\\p{ASCII}]", " ");
s2 = s2.replaceAll(" ", " ");

➩ Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them.
```
s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
```

Testing String and outputs on different approaches like ➩ Unidecode, Normalizer, StringUtils.

String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";

// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );

// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");

String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );

Using Unidecode is the best choice, My final Code shown below.

public static void main(String[] args) {
    String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
    String initials = Unidecode.decode( s2 );
    if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
        System.out.println("Equal Unicode Strings");
    } else if( s1.equals( initials ) ) {
        System.out.println("Equal Non Unicode Strings");
    } else {
        System.out.println("Not Equal");
    }

}

Absenteeism answered 8/9, 2017 at 13:54 Comment(0)

I suggest Junidecode . It will handle not only 'Ł' and 'Ø', but it also works well for transcribing from other alphabets, such as Chinese, into Latin alphabet.

Trudey answered 13/11, 2017 at 15:11 Comment(5)

Looks promising, but I wish this was a more active/maintained project and available on Maven. – Dismiss 7/12, 2018 at 6:29

Thanks @Trudey for sharing this great library – Cervine 1/1, 2022 at 18:36

@Dismiss it's available in maven as well search.maven.org/artifact/net.gcardone.junidecode/junidecode/… – Cervine 1/1, 2022 at 18:37

@Trudey How do I include or import this library into Talend/Java? – Seibel 4/1, 2023 at 15:0

@Trudey The way you import this library into a Talend job is via the tLibraryLoad component. After you connect it to a tJava component, adding the following line in the advanced settings import static net.gcardone.junidecode.Junidecode.*;. From there, you can invoke the method to transliterate your string. – Seibel 5/1, 2023 at 14:42

Since this solution is already available in StringUtils.stripAccents() at Maven Repository and working for Ł as mentioned by @DavidS. But I need this to be working for both Ø and Ł So modified as below. May be help full for others too.

Update

This is modified version of StringUtils.stripAccents(String obj), that contains old functionality along with handling both Ø and Ł chars.

public static String stripAccents(final String input) {
    if (input == null) {
        return null;
    }
    final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
    for (int i = 0; i < decomposed.length(); i++) {
        if (decomposed.charAt(i) == '\u0141') {
            decomposed.setCharAt(i, 'L');
        } else if (decomposed.charAt(i) == '\u0142') {
            decomposed.setCharAt(i, 'l');
        }else if (decomposed.charAt(i) == '\u00D8') {
            decomposed.setCharAt(i, 'O');
        }else if (decomposed.charAt(i) == '\u00F8') {
            decomposed.setCharAt(i, 'o');
        }
    }
    // Note that this doesn't correctly remove ligatures...
    return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}

Input string Ł Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Ø ø
output string L This is a funky String O o

Correy answered 21/2, 2022 at 11:13 Comment(3)

not sure what input data you applied, may be Normalizer.Form.NFC, NFKC, NFKD you can also try: like docs.oracle.com/javase/7/docs/api/java/text/… – Correy 19/9, 2022 at 9:14

I applied same input Ł Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Ø ø – Westcott 19/9, 2022 at 9:19

One of the best way using regex and Normalizer if you have no library is :

    public String flattenToAscii(String s) {
                if(s == null || s.trim().length() == 0)
                        return "";
                return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}

This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics (just like your example).

Otherwise, you have to use the p{ASCII} pattern.

Regards.

Beberg answered 13/12, 2018 at 8:28 Comment(0)

@David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.

public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String norm = Normalizer.normalize(string, Normalizer.Form.NFD);

    int j = 0;
    for (int i = 0, n = norm.length(); i < n; ++i) {
        char c = norm.charAt(i);
        int type = Character.getType(c);

        //Log.d(TAG,""+c);
        //by Ricardo, modified the character check for accents, ref: https://mcmap.net/q/100875/-regex-what-is-incombiningdiacriticalmarks
        if (type != Character.NON_SPACING_MARK){
            out[j] = c;
            j++;
        }
    }
    //Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
    return new String(out);
}

Hypertension answered 9/7, 2015 at 4:31 Comment(0)

I think the best solution is converting each char to HEX and replace it with another HEX. It's because there are 2 Unicode typing:

Composite Unicode
Precomposed Unicode

For example "Ồ" written by Composite Unicode is different from "Ồ" written by Precomposed Unicode. You can copy my sample chars and convert them to see the difference.

In Composite Unicode, "Ồ" is combined from 2 char: Ô (U+00d4) and ̀ (U+0300)
In Precomposed Unicode, "Ồ" is single char (U+1ED2)

I have developed this feature for some banks to convert the info before sending it to core-bank (usually don't support Unicode) and faced this issue when the end-users use multiple Unicode typing to input the data. So I think, converting to HEX and replace it is the most reliable way.

Arsenal answered 11/5, 2020 at 10:44 Comment(0)

A fast and safer way

public static String removeDiacritics(String str) {
    if (str == null)
        return null;
    if (str.isEmpty())
        return "";
    
    int len = str.length();
    StringBuilder sb
        = new StringBuilder(len);
    
    //iterate string codepoints
    for (int i = 0; i < len; ) {
        int codePoint = str.codePointAt(i);
        int charCount
            = Character.charCount(codePoint);
        
        if (charCount > 1) {
            for (int j = 0; j < charCount; j++)
                sb.append(str.charAt(i + j));
            i += charCount;
            continue;
        }
        else if (codePoint <= 127) {
            sb.append((char)codePoint);
            i++;
            continue;
        }
        
        sb.append(
            java.text.Normalizer
                .normalize(
                    Character.toString((char)codePoint),
                    java.text.Normalizer.Form.NFD)
                        .charAt(0));
        i++;
    }
    
    return sb.toString();
}

Sap answered 19/12, 2021 at 19:18 Comment(0)

Faced the same issue, here's solution using Kotlin extension

   val String.stripAccents: String
    get() = Regex("\\p{InCombiningDiacriticalMarks}+")
        .replace(
            Normalizer.normalize(this, Normalizer.Form.NFD),
            ""
        )

usage

val textWithoutAccents = "some accented string".stripAccents

Plastometer answered 23/8, 2022 at 21:47 Comment(0)

-2

In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:

   fun stripAccents(s: String):String{

   if (s == null) {
      return "";
   }

val chars: CharArray = s.toCharArray()

var sb = StringBuilder(s)
var cont: Int = 0

while (chars.size > cont) {
    var c: kotlin.Char
    c = chars[cont]
    var c2:String = c.toString()
   //these are my needs, in case you need to convert other accents just Add new entries aqui
    c2 = c2.replace("Ã", "A")
    c2 = c2.replace("Õ", "O")
    c2 = c2.replace("Ç", "C")
    c2 = c2.replace("Á", "A")
    c2 = c2.replace("Ó", "O")
    c2 = c2.replace("Ê", "E")
    c2 = c2.replace("É", "E")
    c2 = c2.replace("Ú", "U")

    c = c2.single()
    sb.setCharAt(cont, c)
    cont++

}

return sb.toString()

}

to use these fun cast the code like this:

     var str: String
     str = editText.text.toString() //get the text from EditText
     str = str.toUpperCase().trim()

     str = stripAccents(str) //call the function

Maidy answered 4/8, 2019 at 13:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update

Recommended topics

Hot tags