Folding case to speed up comparisons

Asked 5/4, 2018 at 20:1 Answered 15/5 at 6:28

"strasse".Equals("STRAße",StringComparison.InvariantCultureIgnoreCase)

This returns true. Which is correct. Unfortunately, when I store one of these in postgres, it thinks they are not the same when doing a case insensitive match (for example, with ~*). I've also tested with citext.

So one solution would be to pre-fold the case, thus storing strasse for either of these values, in another column. I could then index and search on that for matches.

I've been looking for how to fold case in C# for a while, and haven't been able to find a solution in C#. Obviously that knowledge is there because it can compare these strings properly, I just can't find where to get it from.

One solution would be to spawn a perl process perl -E "binmode STDOUT, ':utf8'; binmode STDIN, ':utf8'; while (<>) { print fc }", set the C# side of the process to utf8 for those pipes as well, and just send the text through perl to fold the case. But there has to be a better way than that.

Schizont answered 5/4, 2018 at 20:1 Comment(6)

Related 1 – Homopolar 5/4, 2018 at 20:6

Related 2 – Homopolar 5/4, 2018 at 20:6

Library UnidecodeSharp could be helpful for this. – Cadency 5/4, 2018 at 21:2

Ah the good old curse of different implementation of collation :-) – Statute 24/12, 2020 at 22:50

What about string.Equals(str1,str2,StringComparison.CurrentCulture) ? – Icing 27/12, 2020 at 17:31

@Icing how does that help do this case-insensitive comparison in postgres ? – Schizont 28/12, 2020 at 2:25

Looking through the sources I eventually found that most of this implementation is in a set of classes called CompareInfo.

You can find these at github.com/dotnet/runtime

That led me to this page that clues in to the inner workings for the .net culture stuff. .NET globalization and ICU

It seems that dotnet is actually relying completely on native libraries for everything except ordinal operations.

I would assume by this that the .Net Framework is probably using NLS from Win32. For that there is the FoldStringW method that looks promising.

For ICU there is documentation for Case Mappings and I found the u_strFoldCase method.

Outlander answered 24/12, 2020 at 23:39 Comment(0)

There is string.Normalize(), which takes a NormalizationForm parameter. Michael Kaplan goes into detail on this. He claims it does a better job than FoldStringW.

It does not, however, normalize the case to upper or lower, it only folds to the canonical form. I would suggest you just apply ToUpper or ToLower afterwards.

Determine answered 29/12, 2020 at 23:14 Comment(5)

The entire original point was to normalise case, although normalising the rest of unicode with combining characters and all that would likely also play a role in matching stuff later. – Schizont 31/12, 2020 at 4:14

When I said "normalize case" I meant specifically to upper or lower case, rather than folding to canonical forms, which was also part of the question. – Determine 31/12, 2020 at 4:27

ToUpper / ToLower don't work for case-insensitive matches in all languages, that's part of the problem. – Schizont 31/12, 2020 at 14:23

Even after string.Normalize? – Determine 31/12, 2020 at 14:32

Did you try it with the above strings? I just pushed it through net core 3.1, and, unsurprisingly, it doesn't do what is required. – Schizont 2/1, 2021 at 20:33

Could be more than you need, but from Unicode Technical Report #36

The Unicode property [NFKC_Casefold] can be used to get a combined casefolding, normalization, and removal of default-ignorable code points.

This is implemented in ICU library wrapper for .NET. A call would look like this:

Icu.Normalization.Normalizer2.GetNFKCCasefoldInstance().Normalize(mystring)

A good overview: Truths programmers should know about case

Lattonia answered 15/5 at 6:28 Comment(0)

Recommended topics

Hot tags