In C# what is the difference between ToUpper() and ToUpperInvariant()?
Asked Answered
L

6

160

In C#, what is the difference between ToUpper() and ToUpperInvariant()?

Can you give an example where the results might be different?

Luu answered 23/8, 2010 at 17:49 Comment(0)
T
179

ToUpper uses the current culture. ToUpperInvariant uses the invariant culture.

The canonical example is Turkey, where the upper case of "i" isn't "I".

Sample code showing the difference:

using System;
using System.Drawing;
using System.Globalization;
using System.Threading;
using System.Windows.Forms;

public class Test
{
    [STAThread]
    static void Main()
    {
        string invariant = "iii".ToUpperInvariant();
        CultureInfo turkey = new CultureInfo("tr-TR");
        Thread.CurrentThread.CurrentCulture = turkey;
        string cultured = "iii".ToUpper();

        Font bigFont = new Font("Arial", 40);
        Form f = new Form {
            Controls = {
                new Label { Text = invariant, Location = new Point(20, 20),
                            Font = bigFont, AutoSize = true},
                new Label { Text = cultured, Location = new Point(20, 100),
                            Font = bigFont, AutoSize = true }
            }
        };        
        Application.Run(f);
    }
}

For more on Turkish, see this Turkey Test blog post.

I wouldn't be surprised to hear that there are various other capitalization issues around elided characters etc. This is just one example I know off the top of my head... partly because it bit me years ago in Java, where I was upper-casing a string and comparing it with "MAIL". That didn't work so well in Turkey...

Tratner answered 23/8, 2010 at 17:51 Comment(2)
haha I read that thinking... "'Turkey' doesn't have a letter 'i' in it"Tali
It's almost 2019 and I'm having Visual Studio suggesting ımage as a field name for Image and Unity 3D spamming an internal error to the console Unable to find key name that matches 'rıght' on an "English" Windows with Turkey regional settings for date and time. Looks like sometimes even Microsoft fails the Turkey test, an PC's language isn't even Turkish, just lol.Rooke
S
30

Jon's answer is perfect. I just wanted to add that ToUpperInvariant is the same as calling ToUpper(CultureInfo.InvariantCulture).

That makes Jon's example a little simpler:

using System;
using System.Drawing;
using System.Globalization;
using System.Threading;
using System.Windows.Forms;

public class Test
{
    [STAThread]
    static void Main()
    {
        string invariant = "iii".ToUpper(CultureInfo.InvariantCulture);
        string cultured = "iii".ToUpper(new CultureInfo("tr-TR"));

        Application.Run(new Form {
            Font = new Font("Times New Roman", 40),
            Controls = { 
                new Label { Text = invariant, Location = new Point(20, 20), AutoSize = true }, 
                new Label { Text = cultured, Location = new Point(20, 100), AutoSize = true }, 
            }
        });
    }
}

I also used New Times Roman because it's a cooler font.

I also set the Form's Font property instead of the two Label controls because the Font property is inherited.

And I reduced a few other lines just because I like compact (example, not production) code.

I really had nothing better to do at the moment.

Semantic answered 23/8, 2010 at 19:13 Comment(2)
ToUpper method doesnt have any parameter overload for me? did older version have? I dont get itTreehopper
I don't know, it's documented here: msdn.microsoft.com/en-us/library/system.string.toupper.aspxSemantic
D
27

String.ToUpper and String.ToLower can give different results given different cultures. The most known example is the Turkish example, for which converting lowercase latin "i" to uppercase, doesn't result in a capitalized latin "I", but in the Turkish "I".

Capitalization of I depending on culture, upper row - lower case letters, lower row - upper case letters

As for me it was confusing even with the above picture (source), I wrote a program (see source code below) to see the exact output for the Turkish example:

# Lowercase letters
Character              | UpperInvariant | UpperTurkish | LowerInvariant | LowerTurkish
English i - i (\u0069) | I (\u0049)     | I (\u0130)   | i (\u0069)     | i (\u0069)
Turkish i - ı (\u0131) | ı (\u0131)     | I (\u0049)   | ı (\u0131)     | ı (\u0131)

# Uppercase letters
Character              | UpperInvariant | UpperTurkish | LowerInvariant | LowerTurkish
English i - I (\u0049) | I (\u0049)     | I (\u0049)   | i (\u0069)     | ı (\u0131)
Turkish i - I (\u0130) | I (\u0130)     | I (\u0130)   | I (\u0130)     | i (\u0069)

As you can see:

  1. Uppercasing lower case letters and lowercasing upper case letters give different results for invariant culture and Turkish culture.
  2. Uppercasing upper case letters and lowercasing lower case letters has no effect, no matter what the culture is.
  3. Culture.CultureInvariant leaves the Turkish characters as is
  4. ToUpper and ToLower are reversible, that is lowercasing a character after uppercasing it, brings it to the original form, as long as for both operations the same culture was used.

According to MSDN, for Char.ToUpper and Char.ToLower Turkish and Azeri are the only affected cultures because they are the only ones with single-character casing differences. For strings, there might be more cultures affected.


Source code of a console application used to generate the output:

using System;
using System.Globalization;
using System.Linq;
using System.Text;

namespace TurkishI
{
    class Program
    {
        static void Main(string[] args)
        {
            var englishI = new UnicodeCharacter('\u0069', "English i");
            var turkishI = new UnicodeCharacter('\u0131', "Turkish i");

            Console.WriteLine("# Lowercase letters");
            Console.WriteLine("Character              | UpperInvariant | UpperTurkish | LowerInvariant | LowerTurkish");
            WriteUpperToConsole(englishI);
            WriteLowerToConsole(turkishI);

            Console.WriteLine("\n# Uppercase letters");
            var uppercaseEnglishI = new UnicodeCharacter('\u0049', "English i");
            var uppercaseTurkishI = new UnicodeCharacter('\u0130', "Turkish i");
            Console.WriteLine("Character              | UpperInvariant | UpperTurkish | LowerInvariant | LowerTurkish");
            WriteLowerToConsole(uppercaseEnglishI);
            WriteLowerToConsole(uppercaseTurkishI);

            Console.ReadKey();
        }

        static void WriteUpperToConsole(UnicodeCharacter character)
        {
            Console.WriteLine("{0,-9} - {1,10} | {2,-14} | {3,-12} | {4,-14} | {5,-12}",
                character.Description,
                character,
                character.UpperInvariant,
                character.UpperTurkish,
                character.LowerInvariant,
                character.LowerTurkish
            );
        }

        static void WriteLowerToConsole(UnicodeCharacter character)
        {
            Console.WriteLine("{0,-9} - {1,10} | {2,-14} | {3,-12} | {4,-14} | {5,-12}",
                character.Description,
                character,
                character.UpperInvariant,
                character.UpperTurkish,
                character.LowerInvariant,
                character.LowerTurkish
            );
        }
    }


    class UnicodeCharacter
    {
        public static readonly CultureInfo TurkishCulture = new CultureInfo("tr-TR");

        public char Character { get; }

        public string Description { get; }

        public UnicodeCharacter(char character) : this(character, string.Empty) {  }

        public UnicodeCharacter(char character, string description)
        {
            if (description == null) {
                throw new ArgumentNullException(nameof(description));
            }

            Character = character;
            Description = description;
        }

        public string EscapeSequence => ToUnicodeEscapeSequence(Character);

        public UnicodeCharacter LowerInvariant => new UnicodeCharacter(Char.ToLowerInvariant(Character));

        public UnicodeCharacter UpperInvariant => new UnicodeCharacter(Char.ToUpperInvariant(Character));

        public UnicodeCharacter LowerTurkish => new UnicodeCharacter(Char.ToLower(Character, TurkishCulture));

        public UnicodeCharacter UpperTurkish => new UnicodeCharacter(Char.ToUpper(Character, TurkishCulture));


        private static string ToUnicodeEscapeSequence(char character)
        {
            var bytes = Encoding.Unicode.GetBytes(new[] {character});
            var prefix = bytes.Length == 4 ? @"\U" : @"\u";
            var hex = BitConverter.ToString(bytes.Reverse().ToArray()).Replace("-", string.Empty);
            return $"{prefix}{hex}";
        }

        public override string ToString()
        {
            return $"{Character} ({EscapeSequence})";
        }
    }
}
Dispersoid answered 28/4, 2016 at 12:2 Comment(2)
The table of cases was very helpful. Thanks!Meredeth
I would clearly say that this is total misdesign from Microsoft. If I make an english "i" uppercase an english "I" should come out ALWAYS. If I make a turkish "ı" uppercase a turkish "İ" should come out. Anything else does not make sense and produces a lot of problems. When I have a 100% english text and make it uppercase there should ALWAYS an english text come out without any turkish letters inside. I cannot understand how Microsoft made such a big design error.Otranto
I
16

Start with MSDN

http://msdn.microsoft.com/en-us/library/system.string.toupperinvariant.aspx

The ToUpperInvariant method is equivalent to ToUpper(CultureInfo.InvariantCulture)

Just because a capital i is 'I' in English, doesn't always make it so.

Inbreathe answered 23/8, 2010 at 17:51 Comment(0)
S
3

ToUpperInvariant uses the rules from the invariant culture

Stubbs answered 23/8, 2010 at 17:52 Comment(0)
E
0

there is no difference in english. only in turkish culture a difference can be found.

Enneahedron answered 23/8, 2010 at 17:53 Comment(9)
And you're sure that Turkish is the only culture in the world that has different rules for upper-case than English? I find that hard to believe.Kenji
Turkish is the most often used example, but not the only one. And it's the language, not the culture that has four different I's. Still, +1 for Turkish.Albeit
sure there must be some others. most ppl will never ever meet those languages in programming anywayEnneahedron
Sure they will. Web Applications are open to the globe and it's good to set your parameters. What if you're operating on a legacy database that doesn't do unicode? What characters will you accept as a username? What if you have to put in Customer names into a Legacy ERP built on COBOL? Lots of cases where the culture is important. Not to mention, dates and numbers. 4.54 is written 4,54 in some languages. Pretending those other languages don't exist won't get you very far in the long run.Albeit
obviously cultures are important for dates and numbers, i'm just telling most ppl will never meet the languages which have a different result in toUpper and toUpperInvariant.Enneahedron
@Enneahedron FWIW I just ran into this issue converting musical degrees read in from "i" to "I". Not saying it happens often, but it does happen :)Historic
Spanish, second native language in the world (after chinese) has different upper-case, like ÁfricaPatrickpatrilateral
@LeandroBardelli It seems that "á".ToUpperInvariant() == "Á" and "ñ".ToUpperInvariant() == "Ñ" so could you please explain what you mean when you say that Spanish has different uppercase.Simplex
According to the link in @Dispersoid answer also the Azeri language is affectedBallistics

© 2022 - 2024 — McMap. All rights reserved.