Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars
Asked Answered
W

12

91

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

For example:

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> n
á --> a
ä --> a
ấ --> a
ṏ --> o

Etc.

  1. I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.

  2. Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.

Wethington answered 21/9, 2009 at 7:3 Comment(6)
It depends on what environment you're programming in, though you'll probably have to maintain some sort of mapping table manually. So, which language are you using?Boult
Please beware that some letters like ñ en.wikipedia.org/wiki/%C3%91 should not be stripped its diacritics for searching purposes. Google correctly differentiates between Spanish "ano"(anus) and "año"(year). So if you really want a good search engine you cannot rely on basic diacritical mark removal.Gardol
@Eduardo: In a given context that might not matter. Using the example the OP gave, searching for a person's name in a multi-national context you actually want the search not to be too accurate.Tactless
(Accidentally sent previous) There is room though for mapping diacritics to their phonetic equivalents to improve phonetic searching. i.e ñ => ni will yield better results if the underlying search engine supports phonetic-based (e.g soundex) searchingTactless
A use case where changing año to ano etc. is stripping non-base64 chars for URLs, IDs etc.Confutation
StringUtils from lib apache.commons has method stripAccents, and it's work very well. commons.apache.org/proper/commons-lang/apidocs/org/apache/…Leonardaleonardi
C
87

I have done this recently in Java:

public static final Pattern DIACRITICS_AND_FRIENDS
    = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}\\u0591-\\u05C7]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

This will do as you specified:

stripDiacritics("Björn")  = Bjorn

but it will fail on for example Białystok, because the ł character is not diacritic.

If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

public class StringSimplifier {
    public static final char DEFAULT_REPLACE_CHAR = '-';
    public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
    private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()

        //Remove crap strings with no sematics
        .put(".", "")
        .put("\"", "")
        .put("'", "")

        //Keep relevant characters as seperation
        .put(" ", DEFAULT_REPLACE)
        .put("]", DEFAULT_REPLACE)
        .put("[", DEFAULT_REPLACE)
        .put(")", DEFAULT_REPLACE)
        .put("(", DEFAULT_REPLACE)
        .put("=", DEFAULT_REPLACE)
        .put("!", DEFAULT_REPLACE)
        .put("/", DEFAULT_REPLACE)
        .put("\\", DEFAULT_REPLACE)
        .put("&", DEFAULT_REPLACE)
        .put(",", DEFAULT_REPLACE)
        .put("?", DEFAULT_REPLACE)
        .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?
        .put("|", DEFAULT_REPLACE)
        .put("<", DEFAULT_REPLACE)
        .put(">", DEFAULT_REPLACE)
        .put(";", DEFAULT_REPLACE)
        .put(":", DEFAULT_REPLACE)
        .put("_", DEFAULT_REPLACE)
        .put("#", DEFAULT_REPLACE)
        .put("~", DEFAULT_REPLACE)
        .put("+", DEFAULT_REPLACE)
        .put("*", DEFAULT_REPLACE)

        //Replace non-diacritics as their equivalent characters
        .put("\u0141", "l") // BiaLystock
        .put("\u0142", "l") // Bialystock
        .put("ß", "ss")
        .put("æ", "ae")
        .put("ø", "o")
        .put("©", "c")
        .put("\u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90
        .put("\u00F0", "d")
        .put("\u0110", "d")
        .put("\u0111", "d")
        .put("\u0189", "d")
        .put("\u0256", "d")
        .put("\u00DE", "th") // thorn Þ
        .put("\u00FE", "th") // thorn þ
        .build();


    public static String simplifiedString(String orig) {
        String str = orig;
        if (str == null) {
            return null;
        }
        str = stripDiacritics(str);
        str = stripNonDiacritics(str);
        if (str.length() == 0) {
            // Ugly special case to work around non-existing empty strings
            // in Oracle. Store original crapstring as simplified.
            // It would return an empty string if Oracle could store it.
            return orig;
        }
        return str.toLowerCase();
    }

    private static String stripNonDiacritics(String orig) {
        StringBuilder ret = new StringBuilder
        String lastchar = null;
        for (int i = 0; i < orig.length(); i++) {
            String source = orig.substring(i, i + 1);
            String replace = NONDIACRITICS.get(source);
            String toReplace = replace == null ? String.valueOf(source) : replace;
            if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {
                toReplace = "";
            } else {
                lastchar = toReplace;
            }
            ret.append(toReplace);
        }
        if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {
            ret.deleteCharAt(ret.length() - 1);
        }
        return ret.toString();
    }

/*
    Special regular expression character ranges relevant for simplification:
    - InCombiningDiacriticalMarks: diacritic marks used in many languages
    - IsLm: Letter, Modifier (see http://www.fileformat.info/info/unicode/category/Lm/list.htm)
    - IsSk: Symbol, Modifier (see http://www.fileformat.info/info/unicode/category/Sk/list.htm)
    - U+0591 to U+05C7: Range for Hebrew diacritics (niqqud) 
      (see official Unicode chart: https://www.unicode.org/charts/PDF/U0590.pdf)
*/
public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile(
    "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}\\u0591-\\u05C7]+"
);


    private static String stripDiacritics(String str) {
        str = Normalizer.normalize(str, Normalizer.Form.NFD);
        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
        return str;
    }
}
Corrosion answered 21/9, 2009 at 7:43 Comment(9)
what about characters like ╨ ?Cathycathyleen
they will be passed-though. likewise all japanese characters etc.Corrosion
thanks Andreas. Is there a way to remove these? Characters like らがなを覚男 (or others) will be included in the generated string and these will basically break the output. I'm trying to use the simplifiedString output as a URL generator as StackOverflow does for its Questions' URLs.Cathycathyleen
As I said in the question comment. You cannot rely on basic diacritical mark removal if you want a good search engine.Gardol
Thanks Andreas, works like a charm! (tested on r̀r̂r̃r̈rʼŕřt̀t̂ẗţỳỹẙyʼy̎ýÿŷp̂p̈s̀s̃s̈s̊sʼs̸śŝŞşšd̂d̃d̈ďdʼḑf̈f̸g̀g̃g̈gʼģq́ĝǧḧĥj̈jʼḱk̂k̈k̸ǩl̂l̃l̈Łłẅẍc̃c̈c̊cʼc̸Çççćĉčv̂v̈vʼv̸b́b̧ǹn̂n̈n̊nʼńņňñm̀m̂m̃m̈m̊m̌ǵß) :-)Avifauna
Great, thanks, really useful, but for me worked only this way "("\\p{InCombiningDiacriticalMarks}+");" . Keeping the other brackets would crash!! But for me did the deal, thanks again.Catkin
Note that none of the Unicode normalization forms (NFC, NFKC, NFD, NFKD) will help transliterate "Bjørn", since the LATIN SMALL LETTER O WITH STROKE character (U+00F8) is not considered a combination. For that, you'll probably need a real transliterator, such as ICU.Kierkegaardian
This doesn't work for Hebrew: en.wikipedia.org/wiki/Diacritic#Hebrew Example: "בְּרֵאשִׁית" doesn't become "בראשית" . It just stays the exact same. Even if it did convert, I don't know how you would handle it, because here it's considered as more characters that are displayed (length is 11 instead of 6).Prepare
public static final Pattern ALL_DIACRITICAL_MARKS = Pattern.compile("[\\p{Mn}\\p{Mc}]"); is also a great more generic soluiton that covers also Arabic, Devanagari, Cyrillic, Greek, Syriac, Thaana diacriticsCorrosion
P
25

The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).

Configure a Collator to sort on PRIMARY differences in characters. With that, create a CollationKey for each string. If all of your code is in Java, you can use the CollationKey directly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.

These classes use the Unicode standard case folding data to determine which characters are equivalent, and support various decomposition strategies.

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY);
Map<CollationKey, String> dictionary = new TreeMap<CollationKey, String>();
dictionary.put(c.getCollationKey("Björn"), "Björn");
...
CollationKey query = c.getCollationKey("bjorn");
System.out.println(dictionary.get(query)); // --> "Björn"

Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The Collator class relieves you from having to track all of these rules and keep them up to date.

Pinnati answered 21/9, 2009 at 14:32 Comment(4)
sounds interesting, but can you search your collation key in the database with select * from person where collated_name like 'bjo%' ??Corrosion
very nice, did not know about that. will try this out.Corrosion
On Android the CollationKeys can not be used as prefixes for database searches. A collation key of the string a turns into bytes 41, 1, 5, 1, 5, 0, yet the string ab turns into bytes 41, 43, 1, 6, 1, 6, 0. These byte sequences don't appear as is in full words (the byte array for collation key a does not appear in the byte array for collation key for ab)Gravois
@GrzegorzAdamHankiewicz After some testing, I see that the byte arrays can be compared, but don't form prefixes, as you noted. So, to do a prefix query like bjo%, you'd need to perform a range query where the collators are >= bjo and < bjp (or whatever the next symbol would be in that locale, and there's no programmatic way to determine that).Pinnati
H
18

It's part of Apache Commons Lang as of ver. 3.1.

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

Hypogeal answered 14/10, 2012 at 10:22 Comment(2)
For Ø it gives again ØMicrofiche
Thanks Mike for pointing that out. The method only handles accents. The result of "ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ" is "n n n n n n n n n ɲ ƞ ᶇ ɳ ȵ"Hypogeal
I
11

You could use the Normalizer class from java.text:

System.out.println(new String(Normalizer.normalize("ń ǹ ň ñ ṅ ņ ṇ ṋ", Normalizer.Form.NFKD).getBytes("ascii"), "ascii"));

But there is still some work to do, since Java makes strange things with unconvertable Unicode characters (it does not ignore them, and it does not throw an exception). But I think you could use that as an starting point.

Incorrigible answered 21/9, 2009 at 7:31 Comment(1)
this will not work for non-ascii diacritics, such as in russian, they have diacritics, too, and furthermore butcher all asian strings. do not use. instead of converting to ascii, use \\p{InCombiningDiacriticalMarks} regexp as in answer stackoverflow.com/questions/1453171/…Corrosion
G
10

There is a draft report on character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".

Here's a discussion and implementation of diacritic marker removal using Perl.

These existing SO questions are related:

Gameto answered 21/9, 2009 at 7:13 Comment(0)
Y
6

Please note that not all of these marks are just "marks" on some "normal" character, that you can remove without changing the meaning.

In Swedish, å ä and ö are true and proper first-class characters, not some "variant" of some other character. They sound different from all other characters, they sort different, and they make words change meaning ("mätt" and "matt" are two different words).

Yokoyokohama answered 1/3, 2010 at 15:46 Comment(1)
Although correct, this is more of a comment than an answer to the question.Biodynamics
D
3

In case of German it's not wanted to remove diacritics from Umlauts (ä, ö, ü). Instead they are replaced by two letter combination (ae, oe, ue) For instance, Björn should be written as Bjoern (not Bjorn) to have correct pronounciation.

For that I would have rather a hardcoded mapping, where you can define the replacement rule individually for each special character group.

Durance answered 8/2, 2013 at 10:16 Comment(0)
F
2

Unicode has specific diatric characters (which are composite characters) and a string can be converted so that the character and the diatrics are separated. Then, you can just remove the diatricts from the string and you're basically done.

For more information on normalization, decompositions and equivalence, see The Unicode Standard at the Unicode home page.

However, how you can actually achieve this depends on the framework/OS/... you're working on. If you're using .NET, you can use the String.Normalize method accepting the System.Text.NormalizationForm enumeration.

Farmann answered 21/9, 2009 at 7:10 Comment(2)
This is the method I use in .NET, though I still have to map some characters manually. They're not diacritics, but digraphs. Similar problem though.Boult
Convert to normalisation form "D" (i.e. decomposed) and take the base character.Hafler
M
2

The easiest way (to me) would be to simply maintain a sparse mapping array which simply changes your Unicode code points into displayable strings.

Such as:

start    = 0x00C0
size     = 23
mappings = {
    "A","A","A","A","A","A","AE","C",
    "E","E","E","E","I","I","I", "I",
    "D","N","O","O","O","O","O"
}
start    = 0x00D8
size     = 6
mappings = {
    "O","U","U","U","U","Y"
}
start    = 0x00E0
size     = 23
mappings = {
    "a","a","a","a","a","a","ae","c",
    "e","e","e","e","i","i","i", "i",
    "d","n","o","o","o","o","o"
}
start    = 0x00F8
size     = 6
mappings = {
    "o","u","u","u","u","y"
}
: : :

The use of a sparse array will allow you to efficiently represent replacements even when they in widely spaced sections of the Unicode table. String replacements will allow arbitrary sequences to replace your diacritics (such as the æ grapheme becoming ae).

This is a language-agnostic answer so, if you have a specific language in mind, there will be better ways (although they'll all likely come down to this at the lowest levels anyway).

Mcgurn answered 21/9, 2009 at 7:41 Comment(1)
Adding all the possible strange characters there is not an easy task. When doing this for only a few characters, it's a good solution.Biodynamics
I
2

In Windows and .NET, I just convert using string encoding. That way I avoid manual mapping and coding.

Try to play with string encoding.

Improbity answered 21/9, 2009 at 14:41 Comment(1)
Can you elaborate on string encoding? For instance, with a code example.Bakker
P
2

Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.

For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.

Perturbation answered 21/9, 2009 at 14:58 Comment(0)
S
0

For future reference, here is a C# extension method that removes accents.

public static class StringExtensions
{
    public static string RemoveDiacritics(this string str)
    {
        return new string(
            str.Normalize(NormalizationForm.FormD)
                .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != 
                            UnicodeCategory.NonSpacingMark)
                .ToArray());
    }
}
static void Main()
{
    var input = "ŃŅŇ ÀÁÂÃÄÅ ŢŤţť Ĥĥ àáâãäå ńņň";
    var output = input.RemoveDiacritics();
    Debug.Assert(output == "NNN AAAAAA TTtt Hh aaaaaa nnn");
}
Schuman answered 26/9, 2009 at 17:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.