Normalization of Strings With String.ToUpperInvariant()
Asked Answered
T

4

12

I am currently storing normalized versions of strings in my SQL Server database in lower case. For example, in my Users table, I have a UserName and a LoweredUserName field. Depending on the context, I either use T-SQL's LOWER() function or C#'s String.ToLower() method to generate the lower case version of the user name to fill the LoweredUserName field. According to Microsoft's guidelines and Visual Studio's code analysis rule CA1308, I should be using C#'s String.ToUpperInvariant() instead of ToLower(). According to Microsoft, this is both a performance and globalization issue: converting to upper case is safe, while converting to lower case can cause a loss of information (for example, the Turkish 'I' problem).

If I move to using ToUpperInvariant for string normalization, I will have to change my database schema as well, since my schema is based on Microsoft's ASP.NET Membership framework (see this related question), which normalizes strings to lower case.

Isn't Microsoft contradicting itself by telling us to use upper case normalization in C#, while it's own code in the Membership tables and procedures is using lower case normalization? Should I switch everything to upper case normalization, or just continue using lower case normalization?

Trott answered 21/4, 2009 at 17:32 Comment(1)
Thank you for mentioning the Turkish 'I' problem -- a little note: As long as we're talking InvariantCulture, neither lowercasing nor uppercasing is safe for the Turkish locale. As it is now, CA1308 is a silly rule.Ceramal
H
4

To answer your first question, yes Microsoft is a bit inconsistent. To answer your second question, no do not switch anything until you have confirmed that this is causing a bottleneck in your application.

Think how much forward progress you can make on you project instead of wasting time switching everything. Your development time is much more valuable than the savings you would get from such a change.

Remember:

Premature optimization is the root of all evil (or at least most of it) in programming. - Donald Knuth

Horsefly answered 21/4, 2009 at 17:38 Comment(2)
This isn't just a performance issue, it is also a globalization issue. According to Microsoft, converting to upper case is safe, while converting to lower case can cause a loss of information (for example, in Turkish 'I' problem).Trott
@Kevin, the Turkish/Azeri dotless I problem remains a special case whichever approach is used (they uppercase i to İ and ı to I), though lowercasing is ambiguous for SS (should it be ss or ß) but that's imperfect too (some orthographies still uppercase ß to SZ). It is still better though. Better still is to use Unicode case-folding rules with a Turkic switch for i and ı, but it still won't be perfect, that can only be per locale :(Saguache
Q
9

According to CA1308, the reason to do this is that some characters cannot be roundtrip converted from upper to lower case. The important thing is that you always move in one direction, so if your standard is to always move to lower case then there is no reason to change it.

Quillan answered 21/4, 2009 at 17:44 Comment(3)
I like this approach. If starting from scratch, following the recommended standard is always the best practise in light of no other motivation to do otherwise, but when working on existing maintenance, it's often folly just to switch because it says so. You need compelling evidence that your project will benefit from the change before embarking on such an overhaul - maybe when you start processing Turkish and hit a problem?Hydrostatic
I totally agree, Jeff, there are some guidance which you should follow and I would say might be worth upgrading existing code to follow (making sure you dispose your data reader for example). This however isn't one of those rules nor is it one that is even close.Quillan
CA1308 makes no sense to me because some characters cannot make the roundtrip when normalized to uppercase either (see: Turkish İ problem). There is nothing wrong with normalizing as long as you know what you are doing. If not, a better warning would be to avoid normalization altogether. In this case, MS is recommending opting for a hand grenade instead of a land mine.Ceramal
H
4

To answer your first question, yes Microsoft is a bit inconsistent. To answer your second question, no do not switch anything until you have confirmed that this is causing a bottleneck in your application.

Think how much forward progress you can make on you project instead of wasting time switching everything. Your development time is much more valuable than the savings you would get from such a change.

Remember:

Premature optimization is the root of all evil (or at least most of it) in programming. - Donald Knuth

Horsefly answered 21/4, 2009 at 17:38 Comment(2)
This isn't just a performance issue, it is also a globalization issue. According to Microsoft, converting to upper case is safe, while converting to lower case can cause a loss of information (for example, in Turkish 'I' problem).Trott
@Kevin, the Turkish/Azeri dotless I problem remains a special case whichever approach is used (they uppercase i to İ and ı to I), though lowercasing is ambiguous for SS (should it be ss or ß) but that's imperfect too (some orthographies still uppercase ß to SZ). It is still better though. Better still is to use Unicode case-folding rules with a Turkic switch for i and ı, but it still won't be perfect, that can only be per locale :(Saguache
C
0

For anyone else who lands here wondering why "uppercase" is recommended, here is a demonstration of the so called "Turkish I Problem":

using System;
                    
public class Program
{
    public static void Main()
    {
        System.Threading.Thread.CurrentThread.CurrentCulture = new System.Globalization.CultureInfo("tr-TR");
        var input = "\u0131"; // LATIN SMALL LETTER DOTLESS I
        Console.WriteLine("Lower then upper equals original: " + (input.ToLower().ToUpper() == input));
        Console.WriteLine("Upper then lower equals original: " + (input.ToUpper().ToLower() == input));
    }
}

The above code produces the following output:

Lower then upper equals original: False
Upper then lower equals original: True

This shows that when using the Turkish culture (i.e. tr-TR), if you normalize strings by converting them all to uppercase, you will get the original string if you later convert those uppercase strings to lowercase. If instead you normalize to lowercase, you won't be able to get back to the original string (i.e. you can't roundtrip).

I am not sure how all this plays out with other languages (Unicode is messy business) but at least for Turkish one can see why uppercase is recommended over lowercase.

Condescendence answered 17/9, 2023 at 14:46 Comment(0)
A
-2

Continue using lower case normalization. Only change to conform to Microsoft standards if a large issue develops.

This is unfortunate, but worthwhile. Sadly, Microsoft "standards" tend to be poorly considered and somewhat less than consistent; experience with them has shown that unless there is a compelling reason, it's best to simply stick with what works while it works. Note that this is generally NOT true of non-Microsoft technologies; but the arbitrariness of the Microsoft "standards" makes them worth avoiding.

Edit: I should clarify here; my opinion of Microsoft is very low, from long experience with their standards. As was pointed out in the comments, I don't have particular references to point out about "everybody else other than Microsoft"; this just comes from my personal experience. Your Mileage May Vary widely. This answer should be considered really just my opinion. Sorry for not making that more clear earlier.

Allin answered 21/4, 2009 at 17:44 Comment(3)
I think you need to cite some sources before making claims of "everyone but Microsoft" when it comes to standards. In recent years, Microsoft seem to take great care in researching the motivations behind their standards and while their implementation of web standards in IE has been far from ideal, the standards they define for us to work within their products are often excellent. Please, back up your statements, lest they be interpreted as bitter opinion.Hydrostatic
I agree Jeff, their standards are very consistent their adoption of their standards are less so but this is expected, code that was written prior to a standard being adopted won't be updated just to bring it into adherance, Imagine if they changed all their name spaces to reflect their new approach for picking namespaces and all the developers who would scream bloody murder.Quillan
Your points are both good; my position does in fact come from pretty much bitter opinion and lots and lots of crappy experience with Microsoft. I'll update to reflect that.Allin

© 2022 - 2024 — McMap. All rights reserved.