Ignoring accented letters in string comparison
Asked Answered
G

6

170

I need to compare 2 strings in C# and treat accented letters the same as non-accented letters. For example:

string s1 = "hello";
string s2 = "héllo";

s1.Equals(s2, StringComparison.InvariantCultureIgnoreCase);
s1.Equals(s2, StringComparison.OrdinalIgnoreCase);

These 2 strings need to be the same (as far as my application is concerned), but both of these statements evaluate to false. Is there a way in C# to do this?

Gifford answered 11/12, 2008 at 15:57 Comment(0)
B
282

FWIW, knightfor's answer below (as of this writing) should be the accepted answer.

Here's a function that strips diacritics from a string:

static string RemoveDiacritics(string text)
{
  string formD = text.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  foreach (char ch in formD)
  {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
    if (uc != UnicodeCategory.NonSpacingMark)
    {
      sb.Append(ch);
    }
  }

  return sb.ToString().Normalize(NormalizationForm.FormC);
}

More details on MichKap's blog (RIP...).

The principle is that is it turns 'é' into 2 successive chars 'e', acute. It then iterates through the chars and skips the diacritics.

"héllo" becomes "he<acute>llo", which in turn becomes "hello".

Debug.Assert("hello"==RemoveDiacritics("héllo"));

Note: Here's a more compact .NET4+ friendly version of the same function:

static string RemoveDiacritics(string text)
{
  return string.Concat( 
      text.Normalize(NormalizationForm.FormD)
      .Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
                                    UnicodeCategory.NonSpacingMark)
    ).Normalize(NormalizationForm.FormC);
}
Blucher answered 15/12, 2008 at 16:6 Comment(3)
How to do it in .net core since it does not have string.Normalize?Albertina
Thanks for this, I wish I could upvote more than once! However, it doesn't handle all accented letters, for example ð, ħ and ø are not converted to o, h and o respectively. Is there any way to handle these as well?Austinaustina
@AvrohomYisroel the "ð" is a "Latin Small Letter Eth", which is a separate letter, not a "o-with-accent" or "d-with-accent". The others are "Latin Small Letter H With Stroke" and "Latin Small Letter O With Stroke" that may also be considered separate lettersGaussmeter
T
164

If you don't need to convert the string and you just want to check for equality you can use

string s1 = "hello";
string s2 = "héllo";

if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == 0)
{
    // both strings are equal
}

or if you want the comparison to be case insensitive as well

string s1 = "HEllO";
string s2 = "héLLo";

if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) == 0)
{
    // both strings are equal
}
Treadmill answered 11/10, 2011 at 2:48 Comment(7)
If anyone else is curious about this IgnoreNonSpace option, you might want to read this discussion on it. pcreview.co.uk/forums/accent-insensitive-t3924592.html TLDR; it's ok :)Spoofery
on msdn : "The Unicode Standard defines combining characters as characters that are combined with base characters to produce a new character. Nonspacing combining characters do not occupy a spacing position by themselves when rendered."Omasum
ok this method failed for these 2 strings : tarafli / TARAFLİ however SQL server says equal as supposed to beTrussing
That is because generally SQL Server is configured to be case insensitive but by default comparisons in .Net are case sensitive. I've updated the answer to show how to make this case insensitive.Treadmill
I'm trying to create a IEqualityComparer. It needs to provide GetHashCode... How do you get that (it needs to be the same if it is equal)Waxwing
In case someone is interested for the HashCode: CultureInfo.CurrentCulture.CompareInfo.GetHashCode(obj, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)Waxwing
Even better, with .Net Core, we can get a StringComparer : StringComparer.Create(CultureInfo.CurrentCulture, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace). (Not available for .Net Framework, unless going through reflection.)Lariat
S
6

I had to do something similar but with a StartsWith method. Here is a simple solution derived from @Serge - appTranslator.

Here is an extension method:

    public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
    {
        if (str.Length >= value.Length)
            return string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
        else
            return false;            
    }

And for one liners freaks ;)

    public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
    {
        return str.Length >= value.Length && string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
    }

Accent incensitive and case incensitive startsWith can be called like this

value.ToString().StartsWith(str, CultureInfo.InvariantCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)
Shoulders answered 19/12, 2013 at 16:15 Comment(0)
S
5

The following method CompareIgnoreAccents(...) works on your example data. Here is the article where I got my background information: http://www.codeproject.com/KB/cs/EncodingAccents.aspx

private static bool CompareIgnoreAccents(string s1, string s2)
{
    return string.Compare(
        RemoveAccents(s1), RemoveAccents(s2), StringComparison.InvariantCultureIgnoreCase) == 0;
}

private static string RemoveAccents(string s)
{
    Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");

    return destEncoding.GetString(
        Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
}

I think an extension method would be better:

public static string RemoveAccents(this string s)
{
    Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");

    return destEncoding.GetString(
        Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
}

Then the use would be this:

if(string.Compare(s1.RemoveAccents(), s2.RemoveAccents(), true) == 0) {
   ...
Sevenfold answered 11/12, 2008 at 16:57 Comment(3)
this makes accented letter to '?'Viticulture
This is a destructive comparison, where for instance ā and ē will be treated as equal. You loose any characters above 0xFF and there's no guarantee that the strings are equal-ignoring-accents.Paprika
You lose as well things like ñ. Not a solution if you ask me.Desiree
E
0

A more simple way to remove accents:

    Dim source As String = "áéíóúç"
    Dim result As String

    Dim bytes As Byte() = Encoding.GetEncoding("Cyrillic").GetBytes(source)
    result = Encoding.ASCII.GetString(bytes)
Extensor answered 1/9, 2014 at 13:5 Comment(1)
This is a destructive comparison, where you loose lots of characters. For example, ā and ē might be treated as equal. Not a good solution for me. Instead, I recommend to look at System.Text.NormalizationForm, which is for Unicode normalization.Amal
C
-4

try this overload on the String.Compare Method.

String.Compare Method (String, String, Boolean, CultureInfo)

It produces a int value based on the compare operations including cultureinfo. the example in the page compares "Change" in en-US and en-CZ. CH in en-CZ is a single "letter".

example from the link

using System;
using System.Globalization;

class Sample {
    public static void Main() {
    String str1 = "change";
    String str2 = "dollar";
    String relation = null;

    relation = symbol( String.Compare(str1, str2, false, new CultureInfo("en-US")) );
    Console.WriteLine("For en-US: {0} {1} {2}", str1, relation, str2);

    relation = symbol( String.Compare(str1, str2, false, new CultureInfo("cs-CZ")) );
    Console.WriteLine("For cs-CZ: {0} {1} {2}", str1, relation, str2);
    }

    private static String symbol(int r) {
    String s = "=";
    if      (r < 0) s = "<";
    else if (r > 0) s = ">";
    return s;
    }
}
/*
This example produces the following results.
For en-US: change < dollar
For cs-CZ: change > dollar
*/

therefor for accented languages you will need to get the culture then test the strings based on that.

http://msdn.microsoft.com/en-us/library/hyxc48dt.aspx

Consent answered 11/12, 2008 at 16:7 Comment(1)
This is a better approach than directly comparing the strings, but it still considers the base letter and its accented version different. Therefore it doesn't answer the original question, which wanted accents to be ignored.Emptor

© 2022 - 2024 — McMap. All rights reserved.