How do I remove diacritics (accents) from a string in .NET?
Asked Answered
B

22

553

I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é to e, so crème brûlée would become creme brulee)

What is the best method for achieving this?

Bedpost answered 30/10, 2008 at 2:7 Comment(4)
A warning: This approach might work in some specific cases, but in general you cannot just remove diacritical marks. In some cases and some languages this might change the meaning of the text. You don't say why you want to do this; if it is for the sake of comparing strings or searching you are most probably better off by using a unicode-aware library for this.Personalism
Since most of the techniques to achieve this rely on Unicode normalization, this document describing the standard may be useful to read: unicode.org/reports/tr15Typist
I think that Azure team fixed this issue, I tried to upload a file with this name "Mémo de la réunion.pdf" and the operation succeeded.Gasaway
In our case the limitation comes from ltree data type from the Postgres database. Where ltree only allows [a-zA-Z0-9_]. And for our case, indeed it is necessary to get a speedy search going.Scarlett
C
660

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder(capacity: normalizedString.Length);

    for (int i = 0; i < normalizedString.Length; i++)
    {
        char c = normalizedString[i];
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder
        .ToString()
        .Normalize(NormalizationForm.FormC);
}

Note that this is a followup to his earlier post: Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.

Cartercarteret answered 30/10, 2008 at 2:29 Comment(24)
Thanks for this, marked as the answer. Basically my application needed to take a title of a section for a website, and change that to a "viewname" in which our flash application could relate it to our navigation bar. this does precisely what i need.Bedpost
Perfect solution ... far better than Regular expression based clean up like this myownpercept.com/2009/06/replacing-special-chars-string-csharpMugger
Why is it necessary to perform canonical composition on the result: sb.ToString().Normalize(NormalizationForm.FormC) ? Having discarded the non-spacing marks, can you not just return sb.ToString() ? Or am I missing something here?Sponsor
@SimonTewsi one thing that immediately occurs is that it'll mean you've Hangul syllables rather than conjoining Jamo, which as well as being required for NFC also has better font support in practice.Amaliaamalie
This is wrong. German characters ä and ö and ü get latinized as ae ue and oe, and not as a, o u ...Atbara
Also, Polish letter ł is ignored.Cylindroid
Both michkap`s blog links are dead.Gradation
Also Norse ø is ignoredCelebrated
Does this depend on PC personalization settings? On my english desktop this method works. For input ä, ö, ü, ø I get an output a, o, u, o.Goatee
@a.farkas2508: Yes, but that's incorrect. ä should be ae, ö should be oe, and ü should be ue, but it acually is a,o,u ...Atbara
@StefanSteiger Please describe the reason why ä should be ae instead of a. Shouldn't æ get decomposed to ae?Footwork
@IllidanS4: It's a common very ancient convention that when you 'latinize' the German characters ä or ö or ü, you use the character without ¨ and append an e; ä becomes ae, ö becomes oe, and ü becomes ue. Likewise, when you encouter these "combination vowels" in a German text, it's spoken as ä ö ü. æ probably should get decomposed to ae, too. On the other hand, whatever NormalizationForm.FormC is ;) Perhaps changing the form will help.Atbara
@StefanSteiger You know, in Czech there are letters like áčěů, which we usually "latinize" to aceu, even though it sounds different and may cause confusion in words like "hrábě" /hra:bje/, "hrabě" /hrabje/, and "hrabe" /hrabe/. To me, it seems that the deletion of diacritics is a purely graphical matter, indepentent on the phonetics or history of the letter. Letters like ä ö ü got created by adding a superscript "e" to the base letters, thus the "ae" decomposition makes sense historically. It depends on the goal - to remove the graphical marks, or to decompose the letter to ASCII characters.Footwork
This function is language-agnostic. It doesn't know whether the string is in German or in some other language. If we take into account that it's okay to replace ö with oe in a German text, but it doesn't make any sense to do it with Turkish, then we'll see that without detecting the language this problem isn't really solvable.Templas
So, this solution doesn't work for Norwegian ø, Polish ł, Turkish ı, Azeri ə, etc. Basically, for any letters without separable diacritics.Templas
@thorn: Yep, and it's wrong for German - unless you consider ä to a "graphically correct" ;) But it might be a good way to normalize a Turkish i.Atbara
Another non working example: Volaða (album name 'Volaða Land' from artist Draugsól). The ð doesn't get reduced to o.Wystand
don't forget to add using System; using System.Globalization; using System.Text; using System.Text.RegularExpressions;Hasty
i don't think the goal is to have a perfectly spokable way of writting it but ratherly having something that will get through most application. I use this for automation, mainly for account creation and name with accents and weird characters. Active Directory and a lot of application out there don't take those characters especially well so when creating any account, I normalize them to something that will be more "Normal" for an account name for example.Orinasal
For some reason, String.Normalize does not work on production builds 🤐Krystenkrystin
See my answer below for a string extension method that is much more explicit about how/what it converts: https://mcmap.net/q/48754/-how-do-i-remove-diacritics-accents-from-a-string-in-netKarolekarolina
@StefanSteiger this function is not meant to do any soft of "latinizing" or "transliterating" of characters. It is simply meant to do what it is named - "RemoveDiacritics". What you are expecting it to do is an entirely different problem. CIRCLE's answer below may help you out more.Momism
after normalizedString we can just return in one fluent syntax like this return normalizedString.Aggregate(new StringBuilder(), (c,e) => CharUnicodeInfo.GetUnicodeCategory(e) == UnicodeCategory.NonSpacingMark ? c : c.Append(e)) .ToString().Normalize(NormalizationForm.FormC);Teletype
@StefanSteiger's comment can be read multiple ways and initially threw me off. To clarify, the code fragment replaces ä, ö and ü with just a, o and u (which is what you may or may not need). And Stefan argues that for German, this subtitution is not correct, it should be ae, ue and oe.Evangelia
B
245

this did the trick for me...

string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);

quick&short!

Biblioclast answered 18/1, 2010 at 14:16 Comment(11)
why do you use ISO-8859-8 to encode the string and UTF8 to decode it? wouldn't it be better to also use ISO-8859-8 to decode the byte array?Piercy
@DominikvonWeber this is just small "issue", but the solution is much better than the one most voted. this one at least is working for chars like ł, ą, ó etc. :)Tauten
I do like this solution and it works well for Windows Store Apps. However, it doesn't work for Windows Phone Apps as encoding ISO-8859-8 doesn't seem to be available. Is there another encoding that can be used instead?Amalekite
I have seen this code as follows: tempBytes = Encoding.GetEncoding(1251).GetBytes(s); asciiStr = Encoding.ASCII.GetString(tempBytes); Seems to me the initial encoding is strongly connected with default encoding system is using (unfortunately, this may differ between dev and prod environment).Hewie
This will work for most common characters, but many special characters like « » and (as a single character) will get altered in the process which ain't the case with the accepted solution.Conidiophore
Note that this doesn't work on .NET Core on Linux: System.ArgumentException: 'ISO-8859-8' is not a supported encoding name.Rooster
If you're on .NET Core, install System.Text.Encoding.CodePages from nuget, then call this to register the provider: Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); - once you've done this, you can make use of ISO-8859-8Tito
For UTF8 this works, but how we handle an ANSI encoded source?Choli
Great, can someone advise me how to use it on a StringBuilder? Tried to use as follow but don't seems to work. StringBuilder sb = new StringBuilder(); byte[] tempBytes; tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(sb); string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes); sb.AppendLine(MyDataHeader); sb.AppendLine(MyDataString);Sulphate
Most characters work, but german "ß" becomes ?.Karmakarmadharaya
Why would anyone sensible encode a string using ISO-8859-8 and then use UTF-8 to decode the same binary data? Are you trying to introduce a mojibake?Footwork
C
51

I needed something that converts all major unicode characters and the voted answer leaved a few out so I've created a version of CodeIgniter's convert_accented_characters($str) into C# that is easily customisable:

using System;
using System.Text;
using System.Collections.Generic;

public static class Strings
{
    static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
    {
        { "äæǽ", "ae" },
        { "öœ", "oe" },
        { "ü", "ue" },
        { "Ä", "Ae" },
        { "Ü", "Ue" },
        { "Ö", "Oe" },
        { "ÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶА", "A" },
        { "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" },
        { "Б", "B" },
        { "б", "b" },
        { "ÇĆĈĊČ", "C" },
        { "çćĉċč", "c" },
        { "Д", "D" },
        { "д", "d" },
        { "ÐĎĐΔ", "Dj" },
        { "ðďđδ", "dj" },
        { "ÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" },
        { "èéêëēĕėęěέεẽẻẹềếễểệеэ", "e" },
        { "Ф", "F" },
        { "ф", "f" },
        { "ĜĞĠĢΓГҐ", "G" },
        { "ĝğġģγгґ", "g" },
        { "ĤĦ", "H" },
        { "ĥħ", "h" },
        { "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" },
        { "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" },
        { "Ĵ", "J" },
        { "ĵ", "j" },
        { "ĶΚК", "K" },
        { "ķκк", "k" },
        { "ĹĻĽĿŁΛЛ", "L" },
        { "ĺļľŀłλл", "l" },
        { "М", "M" },
        { "м", "m" },
        { "ÑŃŅŇΝН", "N" },
        { "ñńņňʼnνн", "n" },
        { "ÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢО", "O" },
        { "òóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợо", "o" },
        { "П", "P" },
        { "п", "p" },
        { "ŔŖŘΡР", "R" },
        { "ŕŗřρр", "r" },
        { "ŚŜŞȘŠΣС", "S" },
        { "śŝşșšſσςс", "s" },
        { "ȚŢŤŦτТ", "T" },
        { "țţťŧт", "t" },
        { "ÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" },
        { "ùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựу", "u" },
        { "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" },
        { "ýÿŷỳỹỷỵй", "y" },
        { "В", "V" },
        { "в", "v" },
        { "Ŵ", "W" },
        { "ŵ", "w" },
        { "ŹŻŽΖЗ", "Z" },
        { "źżžζз", "z" },
        { "ÆǼ", "AE" },
        { "ß", "ss" },
        { "IJ", "IJ" },
        { "ij", "ij" },
        { "Œ", "OE" },
        { "ƒ", "f" },
        { "ξ", "ks" },
        { "π", "p" },
        { "β", "v" },
        { "μ", "m" },
        { "ψ", "ps" },
        { "Ё", "Yo" },
        { "ё", "yo" },
        { "Є", "Ye" },
        { "є", "ye" },
        { "Ї", "Yi" },
        { "Ж", "Zh" },
        { "ж", "zh" },
        { "Х", "Kh" },
        { "х", "kh" },
        { "Ц", "Ts" },
        { "ц", "ts" },
        { "Ч", "Ch" },
        { "ч", "ch" },
        { "Ш", "Sh" },
        { "ш", "sh" },
        { "Щ", "Shch" },
        { "щ", "shch" },
        { "ЪъЬь", "" },
        { "Ю", "Yu" },
        { "ю", "yu" },
        { "Я", "Ya" },
        { "я", "ya" },
    };

    public static char RemoveDiacritics(this char c){
        foreach(KeyValuePair<string, string> entry in foreign_characters)
        {
            if(entry.Key.IndexOf (c) != -1)
            {
                return entry.Value[0];
            }
        }
        return c;
    }

    public static string RemoveDiacritics(this string s) 
    {
        //StringBuilder sb = new StringBuilder ();
        string text = "";


        foreach (char c in s)
        {
            int len = text.Length;

            foreach(KeyValuePair<string, string> entry in foreign_characters)
            {
                if(entry.Key.IndexOf (c) != -1)
                {
                    text += entry.Value;
                    break;
                }
            }

            if (len == text.Length) {
                text += c;  
            }
        }
        return text;
    }
}

Usage

// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee

// for chars
"Ã"[0].RemoveDiacritics (); // A
Chesterfieldian answered 14/12, 2015 at 16:46 Comment(13)
Your implementation does the job, but should be improved before being used in production code.Katerinekates
github.com/apache/lucenenet/blob/master/src/…Krystenkrystin
why not simply replace this if (entry.Key.IndexOf(c) != -1) into if (entry.Key.Contains(c))Jonas
Why not reuse RemoveDiacritics(char c) in the loop, why not to use StringBuilder. I upvote for complex dictionary and working solution, but the code could be much simplerJonas
This solution assumes that you know there is a diacritic to handle... Assuming you are running this against a list of many names, you will require additional logic to identity those names with a diacritic to throw into this solution.Groundling
I used @Alexander's link to make an answer below: https://mcmap.net/q/48754/-how-do-i-remove-diacritics-accents-from-a-string-in-netKarolekarolina
I don't understand why there's so much hoop jumping to use { "äæǽ", "ae" } instead of { "ä", "ae" }, { "æ", "ae" }, { "ǽ", "ae" } and just calling if (foreign_characters.TryGetValue(...)) .... You've completely defeated the purpose of the index that the dictionary already has.Finstad
"foreign"? How rude! Maybe foreign for you! Also different languages have different rules.Meade
Can I use this in my personal project?Hyrcania
@Hyrcania sure ;)Chesterfieldian
@BaconBits Even better would be to use char, rather than string, as the key type.Nette
Best solution for me. Thanks.Achondrite
@PierreArnaud Improved how?Vincevincelette
C
40

The accepted answer is totally correct, but nowadays, it should be updated to use Rune class instead of CharUnicodeInfo, as C# & .NET updated the way to analyse strings in latest versions (Rune class has been added in .NET Core 3.0).

The following code for .NET 5+ is now recommended, as it go further for non-latin chars :

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString.EnumerateRunes())
    {
        var unicodeCategory = Rune.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
Consume answered 17/5, 2021 at 12:40 Comment(3)
this works surprisingly well. Thanks.Flaunch
This one works perfectly with accents and ~Kassia
Note that this works differently in different environments. å was correctly changed to a on my Windows dev machine. But once deployed to docker, å wasn't changed by this function, and remained as å instead of a. I suspect the same problem will apply to the accepted answer, but the answer from azrafe7 worked perfectly in both environments.Mucor
C
37

In case someone is interested, I was looking for something similar and ended writing the following:

public static string NormalizeStringForUrl(string name)
{
    String normalizedString = name.Normalize(NormalizationForm.FormD);
    StringBuilder stringBuilder = new StringBuilder();

    foreach (char c in normalizedString)
    {
        switch (CharUnicodeInfo.GetUnicodeCategory(c))
        {
            case UnicodeCategory.LowercaseLetter:
            case UnicodeCategory.UppercaseLetter:
            case UnicodeCategory.DecimalDigitNumber:
                stringBuilder.Append(c);
                break;
            case UnicodeCategory.SpaceSeparator:
            case UnicodeCategory.ConnectorPunctuation:
            case UnicodeCategory.DashPunctuation:
                stringBuilder.Append('_');
                break;
        }
    }
    string result = stringBuilder.ToString();
    return String.Join("_", result.Split(new char[] { '_' }
        , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}
Chimerical answered 23/4, 2009 at 8:32 Comment(4)
You should preallocate the StringBuilder buffer to the name.Length to minimize memory allocation overhead. That last Split/Join call to remove sequential duplicate _ is interesting. Perhaps we should just avoid adding them in the loop. Set a flag for the previous character being an _ and not emit one if true.Milreis
2 really good points, I'll rewrite it if I ever get the time to go back to this portion of code :)Chimerical
Nice. In addition to IDisposables comment, we should probably check for c < 128, to ensure that we don't pickup any UTF chars, see here.Cart
Or probably more efficiently c < 123. see ASCICart
L
16

I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:

  • Normalizing to form D splits charactes like è to an e and a nonspacing `
  • From this, the nospacing characters are removed
  • The result is normalized back to form C (I'm not sure if this is neccesary)

Code:

using System.Linq;
using System.Text;
using System.Globalization;

// namespace here
public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);

        return cleanStr;
    }

    // or, alternatively
    public static string RemoveDiacritics2(this string str)
    {
        if (null == str) return null;
        var chars = str
            .Normalize(NormalizationForm.FormD)
            .ToCharArray()
            .Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            .ToArray();

        return new string(chars).Normalize(NormalizationForm.FormC);
    }
}
Liechtenstein answered 31/10, 2012 at 10:5 Comment(0)
T
15

The CodePage of Greek (ISO) can do it

The information about this codepage is into System.Text.Encoding.GetEncodings(). Learn about in: https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx

Greek (ISO) has codepage 28597 and name iso-8859-7.

Go to the code... \o/

string text = "Você está numa situação lamentável";

string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"

string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"

So, write this function...

public string RemoveAcentuation(string text)
{
    return
        System.Web.HttpUtility.UrlDecode(
            System.Web.HttpUtility.UrlEncode(
                text, Encoding.GetEncoding("iso-8859-7")));
}

Note that... Encoding.GetEncoding("iso-8859-7") is equivalent to Encoding.GetEncoding(28597) because first is the name, and second the codepage of Encoding.

Thailand answered 5/8, 2016 at 1:46 Comment(2)
That's brilliant! Short and efficient!Ploy
Great stuff. Almost any characters I tested passed. (äáčďěéíľľňôóřŕšťúůýž ÄÁČĎĚÉÍĽĽŇÔÓŘŔŠŤÚŮÝŽ ÖÜË łŁđĐ ţŢşŞçÇ øı). The problems were found only with ßə, which are converted to ?, but such exceptions can always be handled in separate way. Before putting this into production, the test should be better made against all Unicode areas containing letters with diacritics.Patriarchate
K
13

TL;DR - C# string extension method

I think the best solution to preserve the meaning of the string is to convert the characters instead of stripping them, which is well illustrated in the example crème brûlée to crme brle vs. creme brulee.

I checked out Alexander's comment above and saw the Lucene.Net code is Apache 2.0 licensed, so I've modified the class into a simple string extension method. You can use it like this:

var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength); 
// "creme brulee"

The function is too long to post in a StackOverflow answer (~139k characters of 30k allowed lol) so I made a gist and attributed the authors:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
/// 
/// <ul>
///   <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
///   <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
///   <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
///   <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
///   <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
///   <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
///   <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
///   <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
///   <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
///   <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
///   <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
///   <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
///   <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
///   <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
///   <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
///   <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&amp;agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// </summary>
    /// <param name="input">     The string of characters to fold </param>
    /// <param name="length">    The length of the folded return string </param>
    /// <returns> length of output </returns>
    public static string FoldToASCII(this string input, int? length = null)
    {
        // See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
    }
}

Hope that helps someone else, this is the most robust solution I've found!

Karolekarolina answered 27/6, 2019 at 19:33 Comment(3)
Caveats: 1) Tthe concept is locale-dependent. For example, "ä" could be "a" or "aa". 2) Misnamed/misdescribed: The result is not necessarily only from the C0 Controls and Basic Latin block. It only converts Latin letters and some symbol variants to "equivalents". (Of course, one could take another pass to replace or remove non-C0 Controls and Basic Latin block characters afterward.) But this will do what it does well.Lohengrin
Thanks for putting this out there. I believe you have a trailing } bracket at the end of the file.Shore
Thanks for this, I suggest adding an early return for string.IsNullOrWhiteSpace(input) and initializing the capacity of the StringBuilder to input.LengthAlisun
L
10

Same as accepted answer but faster, using Span instead of StringBuilder.
Requires .NET Core 3.1 or newer .NET.

static string RemoveDiacritics(string text) 
{
    ReadOnlySpan<char> normalizedString = text.Normalize(NormalizationForm.FormD);
    int i = 0;
    Span<char> span = text.Length < 1000
        ? stackalloc char[text.Length]
        : new char[text.Length];

    foreach (char c in normalizedString)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            span[i++] = c;
    }

    return new string(span).Normalize(NormalizationForm.FormC);
}

Also this is extensible for additional character replacements e.g. for polish Ł.

span[i++] = c switch
{
    'Ł' => 'L',
    'ł' => 'l',
    _ => c
};

A small note: Stack allocation stackalloc is rather faster than Heap allocation new, and it makes less work for Garbage Collector. 1000 is a threshold to avoid allocating large structures on Stack which may cause StackOverflowException. While 1000 is a pretty safe value, in most cases 10000 or even 100000 would also work (100k allocates on Stack up to 200kB while default stack size is 1 MB). However 100k looks for me a bit dangerous.

Luminescence answered 21/4, 2021 at 6:9 Comment(0)
H
8

It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.

Since the original question was related to French, the simplest working answer is indeed

    public static string ConvertWesternEuropeanToASCII(this string str)
    {
        return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
    }

1251 should be replaced by the encoding code of the input language.

This however replace only one character by one character. Since I am also working with German as input, I did a manual convert

    public static string LatinizeGermanCharacters(this string str)
    {
        StringBuilder sb = new StringBuilder(str.Length);
        foreach (char c in str)
        {
            switch (c)
            {
                case 'ä':
                    sb.Append("ae");
                    break;
                case 'ö':
                    sb.Append("oe");
                    break;
                case 'ü':
                    sb.Append("ue");
                    break;
                case 'Ä':
                    sb.Append("Ae");
                    break;
                case 'Ö':
                    sb.Append("Oe");
                    break;
                case 'Ü':
                    sb.Append("Ue");
                    break;
                case 'ß':
                    sb.Append("ss");
                    break;
                default:
                    sb.Append(c);
                    break;
            }
        }
        return sb.ToString();
    }

It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.

I also have a very simple method to remove space:

    public static string RemoveSpace(this string str)
    {
        return str.Replace(" ", string.Empty);
    }

Eventually, I am using a combination of all 3 above extensions:

    public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
    {
        str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();            
        return keepSpace ? str : str.RemoveSpace();
    }

And a small unit test to that (not exhaustive) which pass successfully.

    [TestMethod()]
    public void LatinizeAndConvertToASCIITest()
    {
        string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
        string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
        string actual = europeanStr.LatinizeAndConvertToASCII();
        Assert.AreEqual(expected, actual);
    }
Herodotus answered 6/2, 2017 at 13:19 Comment(0)
L
6

For simply removing French Canadian accent marks as the original question asked, here's an alternate method that uses a regular expression instead of hardcoded conversions and For/Next loops. Depending on your needs, it could be condensed into a single line of code; however, I added it to an extensions class for easier reusability.

Visual Basic

Imports System.Text
Imports System.Text.RegularExpressions

Public MustInherit Class StringExtension
    Public Shared Function RemoveDiacritics(Text As String) As String
        Return New Regex("\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), String.Empty)
    End Function
End Class

Implementation

    Private Shared Sub DoStuff()
        MsgBox(StringExtension.RemoveDiacritics(inputString))
    End Sub

c#

using System.Text;
using System.Text.RegularExpressions;

namespace YourApplication
{
    public abstract class StringExtension
    {
        public static string RemoveDiacritics(string Text)
        {
            return new Regex(@"\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), string.Empty);
        }
    }
}

Implementation

        private static void DoStuff()
        {
            MessageBox.Show(StringExtension.RemoveDiacritics(inputString));
        }

Input:    äáčďěéíľľňôóřŕšťúůýž ÄÁČĎĚÉÍĽĽŇÔÓŘŔŠŤÚŮÝŽ ÖÜË łŁđĐ ţŢşŞçÇ øı

Output: aacdeeillnoorrstuuyz AACDEEILLNOORRSTUUYZ OUE łŁđĐ tTsScC øı

I included characters that wouldn't be converted to help visualize what happens when unexpected input is received.

If you need it to also convert other types of characters such as the Polish ł and Ł, then depending on your needs, consider incorporating this answer (.NET Core friendly) that uses CodePagesEncodingProvider into your solution.

Lim answered 21/6, 2022 at 18:48 Comment(1)
I really wonder why I'm the first to vote for this. It just comes down to just one line of code. I didn't need to add those namespaces as they are probably available by default.Outstanding
P
5

For anyone who finds Lucene.Net as an overkill for removing diacritics, I managed to find this small library, that utilize ASCII transliteration for you.

https://github.com/anyascii/anyascii

Periodicity answered 19/8, 2022 at 8:54 Comment(1)
ok, this answer is way under-appreciated. the best one if you need to support many different scripts - awesome!Bearing
R
4

This is how i replace diacritic characters to non-diacritic ones in all my .NET program

C#:

//Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
public string RemoveDiacritics(string s)
{
    string normalizedString = null;
    StringBuilder stringBuilder = new StringBuilder();
    normalizedString = s.Normalize(NormalizationForm.FormD);
    int i = 0;
    char c = '\0';

    for (i = 0; i <= normalizedString.Length - 1; i++)
    {
        c = normalizedString[i];
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().ToLower();
}

VB .NET:

'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'
Public Function RemoveDiacritics(ByVal s As String) As String
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char

    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString().ToLower()
End Function
Riffe answered 1/8, 2013 at 18:55 Comment(0)
D
3

THIS IS THE VB VERSION (Works with GREEK) :

Imports System.Text

Imports System.Globalization

Public Function RemoveDiacritics(ByVal s As String)
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function
Dewar answered 28/7, 2010 at 13:21 Comment(1)
Might be an old answer, but why are you using separate lines for variable declaration and first assignment?Facies
P
3

Popping this Library here if you haven't already considered it. Looks like there are a full range of unit tests with it.

https://github.com/thomasgalliker/Diacritics.NET

Perron answered 21/5, 2017 at 21:10 Comment(0)
M
2

This code worked for me:

var updatedText = text.Normalize(NormalizationForm.FormD)
     .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
     .ToArray();

However, please don't do this with names. It's not only an insult to people with umlauts/accents in their name, it can also be dangerously wrong in certain situations (see below). There are alternative writings instead of just removing the accent.

Furthermore, it's simply wrong and dangerous, e.g. if the user has to provide his name exactly how it occurs on the passport.

For example my name is written Zuberbühler and in the machine readable part of my passport you will find Zuberbuehler. By removing the umlaut, the name will not match with either part. This can lead to issues for the users.

You should rather disallow umlauts/accent in an input form for names so the user can write his name correctly without its umlaut or accent.

Practical example, if the web service to apply for ESTA (https://www.application-esta.co.uk/special-characters-and) would use above code instead of transforming umlauts correctly, the ESTA application would either be refused or the traveller will have problems with the American Border Control when entering the States.

Another example would be flight tickets. Assuming you have a flight ticket booking web application, the user provides his name with an accent and your implementation is just removing the accents and then using the airline's web service to book the ticket! Your customer may not be allowed to board since the name does not match to any part of his/her passport.

Millner answered 3/9, 2020 at 17:45 Comment(1)
This wouldn't work for Korean, FormC is needed.Luminescence
P
1

Try HelperSharp package.

There is a method RemoveAccents:

 public static string RemoveAccents(this string source)
 {
     //8 bit characters 
     byte[] b = Encoding.GetEncoding(1251).GetBytes(source);

     // 7 bit characters
     string t = Encoding.ASCII.GetString(b);
     Regex re = new Regex("[^a-zA-Z0-9]=-_/");
     string c = re.Replace(t, " ");
     return c;
 }
Piddock answered 3/5, 2013 at 2:25 Comment(0)
W
1

you can use string extension from MMLib.Extensions nuget package:

using MMLib.RapidPrototyping.Generators;
public void ExtensionsExample()
{
  string target = "aácčeéií";
  Assert.AreEqual("aacceeii", target.RemoveDiacritics());
} 

Nuget page: https://www.nuget.org/packages/MMLib.Extensions/ Codeplex project site https://mmlib.codeplex.com/

Whitmer answered 30/12, 2013 at 10:25 Comment(1)
No source code; The nugget package points towards a invalid url, hosted at codeplexPostglacial
B
1
Imports System.Text
Imports System.Globalization

 Public Function DECODE(ByVal x As String) As String
        Dim sb As New StringBuilder
        For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)  
            sb.Append(c)
        Next
        Return sb.ToString()
    End Function
Blameless answered 22/6, 2015 at 13:34 Comment(1)
Using NFD instead of NFC would cause changes far beyond those requested.Amaliaamalie
N
1

What this person said:

Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));

It actually splits the likes of å which is one character (which is character code 00E5, not 0061 plus the modifier 030A which would look the same) into a plus some kind of modifier, and then the ASCII conversion removes the modifier, leaving the only a.

Necromancy answered 11/12, 2015 at 17:9 Comment(0)
P
1

I really like the concise and functional code provided by azrafe7. So, I have changed it a little bit to convert it to an extension method:

public static class StringExtensions
{
    public static string RemoveDiacritics(this string text)
    {
        const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";

        if (string.IsNullOrEmpty(text))
        {
            return string.Empty;
        }

        return Encoding.ASCII.GetString(
            Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));
    }
}
Pelias answered 14/2, 2017 at 18:54 Comment(1)
This is the only method that works with all polish diacritics. Accepted answer doesn't work with Ł and ł characters.Spadiceous
V
-3

Not having enough reputations, apparently I can not comment on Alexander's excellent link. - Lucene appears to be the only solution working in reasonably generic cases.

For those wanting a simple copy-paste solution, here it is, leveraging code in Lucene:

string testbed = "ÁÂÄÅÇÉÍÎÓÖØÚÜÞàáâãäåæçèéêëìíîïðñóôöøúüāăčĐęğıŁłńŌōřŞşšźžșțệủ";

Console.WriteLine(Lucene.latinizeLucene(testbed));

AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu

//////////

public static class Lucene
{
    // source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
    // idea: https://mcmap.net/q/48754/-how-do-i-remove-diacritics-accents-from-a-string-in-net (scroll down, search for lucene by Alexander)
    public static string latinizeLucene(string arg)
    {
        char[] argChar = arg.ToCharArray();

        // latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ⑽ to (10)
        char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();

        int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);

        string ret = new string(resultChar);
        ret = ret.Substring(0, outputPos);

        return ret;
    }

    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// <para/>
    /// @lucene.internal
    /// </summary>
    /// <param name="input">     The characters to fold </param>
    /// <param name="inputPos">  Index of the first character to fold </param>
    /// <param name="output">    The result of the folding. Should be of size >= <c>length * 4</c>. </param>
    /// <param name="outputPos"> Index of output where to put the result of the folding </param>
    /// <param name="length">    The number of characters to fold </param>
    /// <returns> length of output </returns>
    private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
    {
        int end = inputPos + length;
        for (int pos = inputPos; pos < end; ++pos)
        {
            char c = input[pos];

            // Quick test: if it's not in range then just keep current character
            if (c < '\u0080')
            {
                output[outputPos++] = c;
            }
            else
            {
                switch (c)
                {
                    case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
                    case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
                    case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
                    case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
                    case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
                    case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
                    case '\u0100': // Ā  [LATIN CAPITAL LETTER A WITH MACRON]
                    case '\u0102': // Ă  [LATIN CAPITAL LETTER A WITH BREVE]
                    case '\u0104': // Ą  [LATIN CAPITAL LETTER A WITH OGONEK]
                    case '\u018F': // Ə  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
                    case '\u01CD': // Ǎ  [LATIN CAPITAL LETTER A WITH CARON]
                    case '\u01DE': // Ǟ  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E0': // Ǡ  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FA': // Ǻ  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0200': // Ȁ  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
                    case '\u0202': // Ȃ  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
                    case '\u0226': // Ȧ  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
                    case '\u023A': // Ⱥ  [LATIN CAPITAL LETTER A WITH STROKE]
                    case '\u1D00': // ᴀ  [LATIN LETTER SMALL CAPITAL A]
                    case '\u1E00': // Ḁ  [LATIN CAPITAL LETTER A WITH RING BELOW]
                    case '\u1EA0': // Ạ  [LATIN CAPITAL LETTER A WITH DOT BELOW]
                    case '\u1EA2': // Ả  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
                    case '\u1EA4': // Ấ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA6': // Ầ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA8': // Ẩ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAA': // Ẫ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAC': // Ậ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAE': // Ắ  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB0': // Ằ  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB2': // Ẳ  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB4': // Ẵ  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB6': // Ặ  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u24B6': // Ⓐ  [CIRCLED LATIN CAPITAL LETTER A]
                    case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
                        output[outputPos++] = 'A';
                        break;
                    case '\u00E0': // à  [LATIN SMALL LETTER A WITH GRAVE]
                    case '\u00E1': // á  [LATIN SMALL LETTER A WITH ACUTE]
                    case '\u00E2': // â  [LATIN SMALL LETTER A WITH CIRCUMFLEX]
                    case '\u00E3': // ã  [LATIN SMALL LETTER A WITH TILDE]
                    case '\u00E4': // ä  [LATIN SMALL LETTER A WITH DIAERESIS]
                    case '\u00E5': // å  [LATIN SMALL LETTER A WITH RING ABOVE]
                    case '\u0101': // ā  [LATIN SMALL LETTER A WITH MACRON]
                    case '\u0103': // ă  [LATIN SMALL LETTER A WITH BREVE]
                    case '\u0105': // ą  [LATIN SMALL LETTER A WITH OGONEK]
                    case '\u01CE': // ǎ  [LATIN SMALL LETTER A WITH CARON]
                    case '\u01DF': // ǟ  [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E1': // ǡ  [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FB': // ǻ  [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0201': // ȁ  [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
                    case '\u0203': // ȃ  [LATIN SMALL LETTER A WITH INVERTED BREVE]
                    case '\u0227': // ȧ  [LATIN SMALL LETTER A WITH DOT ABOVE]
                    case '\u0250': // ɐ  [LATIN SMALL LETTER TURNED A]
                    case '\u0259': // ə  [LATIN SMALL LETTER SCHWA]
                    case '\u025A': // ɚ  [LATIN SMALL LETTER SCHWA WITH HOOK]
                    case '\u1D8F': // ᶏ  [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
                    case '\u1D95': // ᶕ  [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
                    case '\u1E01': // ạ  [LATIN SMALL LETTER A WITH RING BELOW]
                    case '\u1E9A': // ả  [LATIN SMALL LETTER A WITH RIGHT HALF RING]
                    case '\u1EA1': // ạ  [LATIN SMALL LETTER A WITH DOT BELOW]
                    case '\u1EA3': // ả  [LATIN SMALL LETTER A WITH HOOK ABOVE]
                    case '\u1EA5': // ấ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA7': // ầ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA9': // ẩ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAB': // ẫ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAD': // ậ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAF': // ắ  [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB1': // ằ  [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB3': // ẳ  [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB5': // ẵ  [LATIN SMALL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB7': // ặ  [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u2090': // ₐ  [LATIN SUBSCRIPT SMALL LETTER A]
                    case '\u2094': // ₔ  [LATIN SUBSCRIPT SMALL LETTER SCHWA]
                    case '\u24D0': // ⓐ  [CIRCLED LATIN SMALL LETTER A]
                    case '\u2C65': // ⱥ  [LATIN SMALL LETTER A WITH STROKE]
                    case '\u2C6F': // Ɐ  [LATIN CAPITAL LETTER TURNED A]
                    case '\uFF41': // a  [FULLWIDTH LATIN SMALL LETTER A]
                        output[outputPos++] = 'a';
                        break;
                    case '\uA732': // Ꜳ  [LATIN CAPITAL LETTER AA]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'A';
                        break;
                    case '\u00C6': // Æ  [LATIN CAPITAL LETTER AE]
                    case '\u01E2': // Ǣ  [LATIN CAPITAL LETTER AE WITH MACRON]
                    case '\u01FC': // Ǽ  [LATIN CAPITAL LETTER AE WITH ACUTE]
                    case '\u1D01': // ᴁ  [LATIN LETTER SMALL CAPITAL AE]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'E';
                        break;
                    case '\uA734': // Ꜵ  [LATIN CAPITAL LETTER AO]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'O';
                        break;
                    case '\uA736': // Ꜷ  [LATIN CAPITAL LETTER AU]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'U';
                        break;

        // etc. etc. etc.
        // see link above for complete source code
        // 
        // unfortunately, postings are limited, as in
        // "Body is limited to 30000 characters; you entered 136098."

                    [...]

                    case '\u2053': // ⁓  [SWUNG DASH]
                    case '\uFF5E': // ~  [FULLWIDTH TILDE]
                        output[outputPos++] = '~';
                        break;
                    default:
                        output[outputPos++] = c;
                        break;
                }
            }
        }
        return outputPos;
    }
}
Valvular answered 13/4, 2019 at 21:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.