Can I obtain the result string used for comparisons with CompareOptions?
Asked Answered
B

2

1

I have custom IComparer<string> which I use to compare strings ignoring their case and symbols like this:

public class LiberalStringComparer : IComparer<string>
{
    private readonly CompareInfo _compareInfo = CultureInfo.InvariantCulture.CompareInfo;
    private const CompareOptions COMPARE_OPTIONS = CompareOptions.IgnoreSymbols | CompareOptions.OrdinalIgnoreCase;

    public int Compare(string x, string y)
    {
        if (x == null) return -1;
        if (y == null) return 1;

        return this._compareInfo.Compare(x, y, COMPARE_OPTIONS);
    }
}

Can I obtain the output string which is, ultimately, used for the comparison?

My final goal is to produce an IEqualityComparer<string> which ignores symbols and casing in the same way as this comparer.

I can write regex to do this, but there's no guarantee that my regex will use the same logic as the built-in comparison options do.

Breakable answered 16/4, 2014 at 19:18 Comment(5)
If you're interested in just Equals you can do yourComparer.Compare(x,y) == 0Well
@SriramSakthivel yes but that doesn't fulfill all the requirements of IEqualityComparer ... I still will need GetHashCodeBreakable
Is it possible that the strings should be parsed into a new, consistent structure (or converted to a consistent sort of string) in a certain way before comparing them? It'd make all of this much simpler, conceptually.Epicurean
@TimS. I believe that your answer relies upon that very approach! The reality in that case would be that I wouldn't be able to use the built-in rules for the parsing but would use my own well-defined rules.Breakable
In a sense, but if you can move the logic from comparing with your options to parsing in a way you know, and then represent it in a simpler fashion, then understanding and comparing your data could become much simpler. I don't know what your data represents, but e.g. if it were dollar/currency amounts, you might parse them as decimal first.Epicurean
E
1

There is probably not such an "output string". I'd implement your Equals in this way:

return liberalStringComparer.Compare(x, y) == 0;

GetHashCode is more complicated.

Some approaches:

  1. Use a poor implementation like return 0; (which means you always have to run a Compare to know if they're equal).
  2. Since your comparison is relatively simple (invariant culture, ordinal ignore case comparison), you should be able to make a hash that generally works. Without extensive study of Unicode and testing, however, I wouldn't recommend that you assume this'll work for any valid Unicode string from any culture.

    In pseudocode:

    public int GetHashCode(string value)
    {
        // for each index in value
        if (!char.IsSymbol(value, i))
            // add value[i].ToUpperInvariant() to the hash using an algorithm
            // like https://mcmap.net/q/30334/-what-is-the-best-algorithm-for-overriding-gethashcode
    }
    
  3. Form a string by removing all where char.IsSymbol is true, then use StringComparer.InvariantCulture.GetHashCode on it.
  4. CompareInfo.GetSortKey's hash code should be a suitable value.

    public int GetHashCode(string value)
    {
        return _compareInfo.GetSortKey(value, COMPARE_OPTIONS).GetHashCode();
    }
    
Epicurean answered 16/4, 2014 at 19:37 Comment(3)
I suppose I could remove all chars which return true for char.IsSymbol || char.IsWhiteSpace and then perform a CultureInvariantIgnoreCase.GetHashCode on those resulting strings... Alternatively I could use the GetUnicodeCategory method and explicitly exclude categories.Breakable
It appears that char.IsSymbol doesn't return true for whitespace. I think, instead, I want to explicitly include things which are char.IsLetterOrDigitBreakable
I selected this answer because I had to create my own reliable string procesor to remove spaces and symbols, then I used the CultureInvariantCaseInsensitive solution which is built-inBreakable
T
2

Quite interesting question here. Internally CompareInfo.Compare uses InternalCompareString method importing COMNlsInfo::InternalCompareString from clr.dll:

// Compare a string using the native API calls -- COMNlsInfo::InternalCompareString   
...
private static extern int InternalCompareString(IntPtr handle, 
             IntPtr handleOrigin, String localeName, String string1, int offset1, 
             int length1, String string2, int offset2, int length2, int flags);

In other words, as you can't be sure about the logic of the built-in function, maybe you should write your own and reuse it in both IEqualityComparer and IComparer implementations.

Townsfolk answered 16/4, 2014 at 19:35 Comment(1)
+1 for demonstrating that I can't reliably replicate the built-in function, and should use a custom and well-defined IComparer<string> and IEqualityComparer<string>Breakable
E
1

There is probably not such an "output string". I'd implement your Equals in this way:

return liberalStringComparer.Compare(x, y) == 0;

GetHashCode is more complicated.

Some approaches:

  1. Use a poor implementation like return 0; (which means you always have to run a Compare to know if they're equal).
  2. Since your comparison is relatively simple (invariant culture, ordinal ignore case comparison), you should be able to make a hash that generally works. Without extensive study of Unicode and testing, however, I wouldn't recommend that you assume this'll work for any valid Unicode string from any culture.

    In pseudocode:

    public int GetHashCode(string value)
    {
        // for each index in value
        if (!char.IsSymbol(value, i))
            // add value[i].ToUpperInvariant() to the hash using an algorithm
            // like https://mcmap.net/q/30334/-what-is-the-best-algorithm-for-overriding-gethashcode
    }
    
  3. Form a string by removing all where char.IsSymbol is true, then use StringComparer.InvariantCulture.GetHashCode on it.
  4. CompareInfo.GetSortKey's hash code should be a suitable value.

    public int GetHashCode(string value)
    {
        return _compareInfo.GetSortKey(value, COMPARE_OPTIONS).GetHashCode();
    }
    
Epicurean answered 16/4, 2014 at 19:37 Comment(3)
I suppose I could remove all chars which return true for char.IsSymbol || char.IsWhiteSpace and then perform a CultureInvariantIgnoreCase.GetHashCode on those resulting strings... Alternatively I could use the GetUnicodeCategory method and explicitly exclude categories.Breakable
It appears that char.IsSymbol doesn't return true for whitespace. I think, instead, I want to explicitly include things which are char.IsLetterOrDigitBreakable
I selected this answer because I had to create my own reliable string procesor to remove spaces and symbols, then I used the CultureInvariantCaseInsensitive solution which is built-inBreakable

© 2022 - 2024 — McMap. All rights reserved.