Java change áéőűú to aeouu [duplicate]
Asked Answered
C

3

50

Possible Duplicates:
Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars
Is there a way to get rid of accents and convert a whole string to regular letters?

How can i do this? Thanks for the help

Corfam answered 8/11, 2010 at 8:10 Comment(3)
duplicate: #1453671Anastomosis
See #1453671Sanies
the question is closed, ask a new questionCacology
C
152

I think your question is the same as these:

and hence the answer is also the same:

String convertedString = 
       Normalizer
           .normalize(input, Normalizer.Form.NFD)
           .replaceAll("[^\\p{ASCII}]", "");

See

Example Code:

final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
    Normalizer
        .normalize(input, Normalizer.Form.NFD)
        .replaceAll("[^\\p{ASCII}]", "")
);

Output:

This is a funky String

Cacology answered 8/11, 2010 at 8:17 Comment(9)
fortunately I just had to copy and paste it from a previous question (including the first paragraph) :-)Cacology
Sorry but its in Android, not only java, the normalizer class is not in android systemsCorfam
OK, then maybe you should ask it again in a separate question, this time mentioning that you are talking about android.Cacology
I think it's worth pointing out that the Normalizer class is part of the Android SDK since API 9.Unsustainable
Funnily Google seems to prefer the duplicates in search results, always.Hadley
That´s a great answer, but need java 6 :(Columbine
@Columbine true, but Java 6 has been around for 10 years, and even Java 7 has been deprecated years ago. Time to upgrade? :-)Cacology
For the solution proposed, the output for "øóöë" is "ooe", and should be "oooe"Doggett
@Doggett "ooe" is correct. The way this works is basically turning a letter with a diacritic into 2 code points, i.e ä -> a + ¨ and then removes all non ASCII characters. But Danish (as well as other languages) doesn't have compulsory diacritics, meaning that letters like ø and å aren't e.g a + ° but instead are their own letter and thus own code point, meaning the entire letter is non ASCII and thus gets removed.Hornet
M
12

You can use java.text.Normalizer to separate base letters and diacritics, then remove the latter via a regexp:

public static String stripDiacriticas(String s) {
    return Normalizer.normalize(s, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
Menstrual answered 8/11, 2010 at 8:15 Comment(1)
I used something similar that did the job: Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(nfdNormalizedString).replaceAll("");Enriqueenriqueta
T
10

First - you shouldn't. These symbols carry special phonetic properties which should not be ignored.

The way to convert them is to create a Map that holds each pair:

Map<Character, Character> map = new HashMap<Character, Character>();
map.put('á', 'a');
map.put('é', 'e');
//etc..

and then loop the chars in the string, creating a new string by calling map.get(currentChar)

Tomboy answered 8/11, 2010 at 8:12 Comment(10)
+1 for you shouldn'tCacology
another +1 for shouldn't. A replacement for 'ä' in german language would be "ae" (surprise: two chars...) and I bet there a lot more examples for other spoken languages.Woodworm
@Andreas true, I guess that would call for a locale-specific Normalizer function (good luck with that :-)).Cacology
There are plenty of reasons why you would to this. e.g. if you wanna store a file on disc but the filename contains these characters. NTFS (as most other FS) wont allow that.Heptagon
Are you sure? I haven't had problems with special-symbol file names recentlyTomboy
@Tomboy are you sure? Or in other words, have you tried every possible unicode character on all filesystems java supports? I wouldn´t take the risk...Heptagon
I think I wanted this for sorting a string collection.Corfam
Another use case (though probably ultimately futile): detecting profanity filter dodging.Astrix
If you want to put them in a decently looking url then you should.Doggett
always ignore diacritics in search fields. Users get annoyed when they have to type a special character otherwise they can't find what they want.Nosography

© 2022 - 2024 — McMap. All rights reserved.