Upper vs Lower Case
Asked Answered
P

9

107

When doing case-insensitive comparisons, is it more efficient to convert the string to upper case or lower case? Does it even matter?

It is suggested in this SO post that C# is more efficient with ToUpper because "Microsoft optimized it that way." But I've also read this argument that converting ToLower vs. ToUpper depends on what your strings contain more of, and that typically strings contain more lower case characters which makes ToLower more efficient.

In particular, I would like to know:

  • Is there a way to optimize ToUpper or ToLower such that one is faster than the other?
  • Is it faster to do a case-insensitive comparison between upper or lower case strings, and why?
  • Are there any programming environments (eg. C, C#, Python, whatever) where one case is clearly better than the other, and why?
Phenazine answered 24/10, 2008 at 17:47 Comment(0)
H
102

Converting to either upper case or lower case in order to do case-insensitive comparisons is incorrect due to "interesting" features of some cultures, particularly Turkey. Instead, use a StringComparer with the appropriate options.

MSDN has some great guidelines on string handling. You might also want to check that your code passes the Turkey test.

EDIT: Note Neil's comment around ordinal case-insensitive comparisons. This whole realm is pretty murky :(

Huge answered 24/10, 2008 at 18:28 Comment(21)
Yes StringComparer is great, but the question wasn't answered... In situations where you can't use StringComparer such as a swtich statement against a string; should I ToUpper or ToLower in the switch?Axiology
Use a StringComparer and "if"/"else" instead of using either ToUpper or ToLower.Huge
John, I know that converting to lower case is incorrect, but I had not heard that converting to uppercase is incorrect. Can you offer an example or a reference? The MSDN article you linked to says this: "Comparisons made using OrdinalIgnoreCase are behaviorally the composition of two calls: calling ToUpperInvariant on both string arguments, and doing an Ordinal comparison." In the section titled "Ordinal String Operations", it restates this in code.Lambdacism
That said, I almost always prefer the StringComparer options for performance reasons.Lambdacism
@Neil: Interesting, I hadn't seen that bit. For an ordinal case-insensitive comparison, I guess that's fair enough. It's got to pick something, after all. For culturally-sensitive case-insensitive comparisons, I think there'd still be room for some odd behaviour. Will point out your comment in the answer...Huge
While this answer gives the author a solution, it also dodges the question. The author is seeking information to make an informed choice between ToUpper and ToLower for performance concerns. While offering an alternative 3rd choice (StringComparer) is legitimate, it should at least be framed in context of its performance (not just its correctness) relative to the other two choices. There's no mention of performance in this answer. It would be better if it included something like "StringComparer.Compare is significantly faster (and more correct), even when comparing exclusively ASCII text."Liaotung
@Triynko: I think it's important to concentrate primarily on correctness, with the point that getting the wrong answer fast is usually no better (and is sometimes worse) than getting the wrong answer slowly.Huge
If implementing a case-insensitive hash table, you'll need to choose either upper case or lower case.Martainn
@IanBoyd: Not necessarily. For example, in .NET you'd just create a Dictionary<string, ...> with something like StringComparer.OrdinalIgnoreCase. You only need to be able to test for case-insensitive-equality, and get an appropriate hash code which is consistent with that.Huge
But if you're creating a hash list (as i had to because the language provided none), you have to Hash a case-neutral version (i.e. uppercase)Martainn
@IanBoyd You don't have to convert the case of your keys if you use a hash algorithm that gives the same result when two strings only differ in their casing. Notice that the StringComparer class includes a GetHashCode() method.Lambdacism
@NeilWhitaker But you forget, i was talking about creating a case-insensitive table. For example, the original question is language agnostic. i happen to mainly develop in a language without Dictionary<string, ...> and StringComparer, because those are in a language different than the language that i, or the original poster, are talking about. If you were implementing a hash table, in assembly, what algorithm would you use to create case-insensitive hash codes? If you were down to choosing between uppercasing and lowercasing, the correct answer is uppercasing.Martainn
One year late, I still would like to add this piece of information for any late reader like me: One would not turn the key to upper case, but during calculation of the hash key a COPY of it. So you keep your key, and hash values will be the same if just cases differ.Aviculture
A long time later... it is difficult in some contexts to use a StringComparer e.g. a LINQ to Objects GroupBy with an anonymous (multi-field) key.Mccaffrey
@NetMage: I agree, that makes it tricky. That doesn't make it correct to just upper-case or lower-case though :(Huge
@Lambdacism his name is Jon, not John.Agnate
@DavidKlempfner: Thanks for the correction. I guess I'm so used to typing "John", I didn't even think about it :)Lambdacism
Tell me, what is so “interesting” about Turkish that you find the most widespread way of doing case-insensitive comparison incorrect?Strother
@Константин Ван: The Turkish "i" problem - see moserware.com/2008/02/does-your-code-pass-turkey-test.html (For example, a couple of decades ago I had code that failed due "mail".toUpperCase() in Java not being "MAIL", when in Turkey.Huge
@JonSkeet I’ve just read the article. Well, that’s, rather interestingly nightmarish, I’d say. Didn’t know that; thank you.Strother
@КонстантинВан: Any reason you didn't read the article from the existing link that's been in the answer for over 12 years? Please don't edit an answer just to add another copy of a link that's already there.Huge
M
39

From Microsoft on MSDN:

Best Practices for Using Strings in the .NET Framework

Recommendations for String Usage

Why? From Microsoft:

Normalize strings to uppercase

There is a small group of characters that when converted to lowercase cannot make a round trip.

What is example of such a character that cannot make a round trip?

  • Start: Greek Rho Symbol (U+03f1) ϱ
  • Uppercase: Capital Greek Rho (U+03a1) Ρ
  • Lowercase: Small Greek Rho (U+03c1) ρ

ϱ , Ρ , ρ

.NET Fiddle

Original: ϱ
ToUpper: Ρ
ToLower: ρ

That is why, if your want to do case insensitive comparisons you convert the strings to uppercase, and not lowercase.

So if you have to choose one, choose Uppercase.

Martainn answered 2/1, 2013 at 20:42 Comment(8)
Back to the answer the original question: There are languages knowing more than one lower case variant for one upper case variant. Unless you know the rules for when to use which representation (another example in Greek: small sigma letter, you use σ at word start or in the middle, ς at the words end (see en.wikipedia.org/wiki/Sigma), you can't securely convert back to the lower case variant.Aviculture
Actually what about German 'ß', if you call ToUpper() it will turn into 'SS' on many systems. So this is actually not round-trip-able either.Bezique
if Microsoft has optimized the code for performing uppercase comparisons is it because the ASCII code for uppercase letters only two digits 65 - 90 while ASCII code Lowercase letters 97 -122 which contains 3 digits (need more processing)Blind
It should be noted that both "ϱ" and "ς" return themselves from ToUpperInvariant(), so it would still be nice to see real examples why uppercase is better than lowercaseEttieettinger
This answer does not appear to be relevant. According to the Microsoft link, this only matters when changing the locale of a string: "To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters." But the question does not involve converting to a different locale.Poetess
@Poetess Which is why we have the best practice to use uppercase, and not lowercase - to avoid the exact problems you mentioned. Also it is relevant, because it matters even without changing the locale of a string.Martainn
ϱ != ρ, but if you use upper case, then wouldn't it essentially change both to Ρ, and then compare Ρ == Ρ which would be true, even though ϱ != ρ?Agnate
@MedoMedo No because any number 0-128 takes the same number of bits in any modern computer - computers store numbers in binary, and mostly operate on fixed-width pieces of memory. The number of digits only matters to us humans, and in the surface area between computers and humans (like when a calculator program interprets 99 vs 100, that's one more character to parse, but after the parsing is done, it's going to be the same size of integer internally, so all operations after that are the same speed).Viperish
Q
20

According to MSDN it is more efficient to pass in the strings and tell the comparison to ignore case:

String.Compare(strA, strB, StringComparison.OrdinalIgnoreCase) is equivalent to (but faster than) calling

String.Compare(ToUpperInvariant(strA), ToUpperInvariant(strB), StringComparison.Ordinal).

These comparisons are still very fast.

Of course, if you are comparing one string over and over again then this may not hold.

Quinine answered 24/10, 2008 at 17:54 Comment(0)
R
12

Based on strings tending to have more lowercase entries, ToLower should theoretically be faster (lots of compares, but few assignments).

In C, or when using individually-accessible elements of each string (such as C strings or the STL's string type in C++), it's actually a byte comparison - so comparing UPPER is no different from lower.

If you were sneaky and loaded your strings into long arrays instead, you'd get a very fast comparison on the whole string because it could compare 4 bytes at a time. However, the load time might make it not worthwhile.

Why do you need to know which is faster? Unless you're doing a metric buttload of comparisons, one running a couple cycles faster is irrelevant to the speed of overall execution, and sounds like premature optimization :)

Reasonless answered 24/10, 2008 at 17:51 Comment(5)
To answer the question why I need to know which is faster: I don't need to know, I merely want to know. :) It's simply a case of seeing somebody make a claim (such as "comparing upper case strings is faster!") and wanting to know whether it is really true and/or why they made that claim.Phenazine
that makes sense - I'm eternally curious on stuff like this, too :)Reasonless
With C strings, to convert s and t to arrays of longs such that the strings are equal iff the arrays are equal you have to walk down s and t until you find the terminating '\0' character (or else you might compare garbage past the end of the strings, which may be an illegal memory access that invokes undefined behavior). But then why not just do the comparisons while walking over the characters one by one? With C++ strings, you can probably get the length and .c_str(), cast to a long * and compare a prefix of length .size() - .size()%(sizeof long). Looks a bit fishy to me, tho.Seasonseasonable
@JonasKölker - loading the string into an array of longs just for comparison's sake would be foolish. But if you're doing it "a lot" - I could see a possible argument for it to be done.Reasonless
Please don't try to “fix” the grammar - especially by removing the apostrophe on “STL’s” that's not a plural: it's a possessiveReasonless
D
5

Microsoft has optimized ToUpperInvariant(), not ToUpper(). The difference is that invariant is more culture friendly. If you need to do case-insensitive comparisons on strings that may vary in culture, use Invariant, otherwise the performance of invariant conversion shouldn't matter.

I can't say whether ToUpper() or ToLower() is faster though. I've never tried it since I've never had a situation where performance mattered that much.

Diary answered 24/10, 2008 at 17:56 Comment(2)
if Microsoft has optimized the code for performing uppercase comparisons is it because the ASCII code for uppercase letters only two digits 65 - 90 while ASCII code Lowercase letters 97 -122 which contains 3 digits (need more processing) ?Blind
@Medo I don't remember the exact reasons for optimization, but 2 vs 3 digits is almost certainly not the reason since all letters are stored as binary numbers, so decimal digits doesn't really have meaning based on the way they are stored.Diary
V
4

If you are doing string comparison in C# it is significantly faster to use .Equals() instead of converting both strings to upper or lower case. Another big plus for using .Equals() is that more memory isn't allocated for the 2 new upper/lower case strings.

Vestiary answered 24/10, 2008 at 17:56 Comment(3)
And as a bonus, if you pick the right options it will actually give you the correct results :)Huge
@JonSkeet In your answer, you suggested using StringComparer. Is that superior, in terms of performance, to using Equals(...)?Nullipore
@Abdul: I haven't measured performance of them, and wouldn't want to guess.Huge
F
3

I wanted some actual data on this, so I pulled the full list of two byte characters that fail with ToLower or ToUpper. I then ran this test below:

using System;

class Program {
   static void Main() {
      char[][] pairs = {
new[]{'\u00E5','\u212B'},new[]{'\u00C5','\u212B'},new[]{'\u0399','\u1FBE'},
new[]{'\u03B9','\u1FBE'},new[]{'\u03B2','\u03D0'},new[]{'\u03B5','\u03F5'},
new[]{'\u03B8','\u03D1'},new[]{'\u03B8','\u03F4'},new[]{'\u03D1','\u03F4'},
new[]{'\u03B9','\u1FBE'},new[]{'\u0345','\u03B9'},new[]{'\u0345','\u1FBE'},
new[]{'\u03BA','\u03F0'},new[]{'\u00B5','\u03BC'},new[]{'\u03C0','\u03D6'},
new[]{'\u03C1','\u03F1'},new[]{'\u03C2','\u03C3'},new[]{'\u03C6','\u03D5'},
new[]{'\u03C9','\u2126'},new[]{'\u0392','\u03D0'},new[]{'\u0395','\u03F5'},
new[]{'\u03D1','\u03F4'},new[]{'\u0398','\u03D1'},new[]{'\u0398','\u03F4'},
new[]{'\u0345','\u1FBE'},new[]{'\u0345','\u0399'},new[]{'\u0399','\u1FBE'},
new[]{'\u039A','\u03F0'},new[]{'\u00B5','\u039C'},new[]{'\u03A0','\u03D6'},
new[]{'\u03A1','\u03F1'},new[]{'\u03A3','\u03C2'},new[]{'\u03A6','\u03D5'},
new[]{'\u03A9','\u2126'},new[]{'\u0398','\u03F4'},new[]{'\u03B8','\u03F4'},
new[]{'\u03B8','\u03D1'},new[]{'\u0398','\u03D1'},new[]{'\u0432','\u1C80'},
new[]{'\u0434','\u1C81'},new[]{'\u043E','\u1C82'},new[]{'\u0441','\u1C83'},
new[]{'\u0442','\u1C84'},new[]{'\u0442','\u1C85'},new[]{'\u1C84','\u1C85'},
new[]{'\u044A','\u1C86'},new[]{'\u0412','\u1C80'},new[]{'\u0414','\u1C81'},
new[]{'\u041E','\u1C82'},new[]{'\u0421','\u1C83'},new[]{'\u1C84','\u1C85'},
new[]{'\u0422','\u1C84'},new[]{'\u0422','\u1C85'},new[]{'\u042A','\u1C86'},
new[]{'\u0463','\u1C87'},new[]{'\u0462','\u1C87'}
      };
      int upper = 0, lower = 0;
      foreach (char[] pair in pairs) {
         Console.Write(
            "U+{0:X4} U+{1:X4} pass: ",
            Convert.ToInt32(pair[0]),
            Convert.ToInt32(pair[1])
         );
         if (Char.ToUpper(pair[0]) == Char.ToUpper(pair[1])) {
            Console.Write("ToUpper ");
            upper++;
         } else {
            Console.Write("        ");
         }
         if (Char.ToLower(pair[0]) == Char.ToLower(pair[1])) {
            Console.Write("ToLower");
            lower++;
         }
         Console.WriteLine();
      }
      Console.WriteLine("upper pass: {0}, lower pass: {1}", upper, lower);
   }
}

Result below. Note I also tested with the Invariant versions, and result was exact same. Interestingly, one of the pairs fails with both. But based on this ToUpper is the best option.

U+00E5 U+212B pass:         ToLower
U+00C5 U+212B pass:         ToLower
U+0399 U+1FBE pass: ToUpper
U+03B9 U+1FBE pass: ToUpper
U+03B2 U+03D0 pass: ToUpper
U+03B5 U+03F5 pass: ToUpper
U+03B8 U+03D1 pass: ToUpper
U+03B8 U+03F4 pass:         ToLower
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: ToUpper
U+0345 U+03B9 pass: ToUpper
U+0345 U+1FBE pass: ToUpper
U+03BA U+03F0 pass: ToUpper
U+00B5 U+03BC pass: ToUpper
U+03C0 U+03D6 pass: ToUpper
U+03C1 U+03F1 pass: ToUpper
U+03C2 U+03C3 pass: ToUpper
U+03C6 U+03D5 pass: ToUpper
U+03C9 U+2126 pass:         ToLower
U+0392 U+03D0 pass: ToUpper
U+0395 U+03F5 pass: ToUpper
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: ToUpper
U+0398 U+03F4 pass:         ToLower
U+0345 U+1FBE pass: ToUpper
U+0345 U+0399 pass: ToUpper
U+0399 U+1FBE pass: ToUpper
U+039A U+03F0 pass: ToUpper
U+00B5 U+039C pass: ToUpper
U+03A0 U+03D6 pass: ToUpper
U+03A1 U+03F1 pass: ToUpper
U+03A3 U+03C2 pass: ToUpper
U+03A6 U+03D5 pass: ToUpper
U+03A9 U+2126 pass:         ToLower
U+0398 U+03F4 pass:         ToLower
U+03B8 U+03F4 pass:         ToLower
U+03B8 U+03D1 pass: ToUpper
U+0398 U+03D1 pass: ToUpper
U+0432 U+1C80 pass: ToUpper
U+0434 U+1C81 pass: ToUpper
U+043E U+1C82 pass: ToUpper
U+0441 U+1C83 pass: ToUpper
U+0442 U+1C84 pass: ToUpper
U+0442 U+1C85 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+044A U+1C86 pass: ToUpper
U+0412 U+1C80 pass: ToUpper
U+0414 U+1C81 pass: ToUpper
U+041E U+1C82 pass: ToUpper
U+0421 U+1C83 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+0422 U+1C84 pass: ToUpper
U+0422 U+1C85 pass: ToUpper
U+042A U+1C86 pass: ToUpper
U+0463 U+1C87 pass: ToUpper
U+0462 U+1C87 pass: ToUpper
upper pass: 46, lower pass: 8
Forecourt answered 25/12, 2020 at 19:31 Comment(0)
L
0

It really shouldn't ever matter. With ASCII characters, it definitely doesn't matter - it's just a few comparisons and a bit flip for either direction. Unicode might be a little more complicated, since there are some characters that change case in weird ways, but there really shouldn't be any difference unless your text is full of those special characters.

Limen answered 24/10, 2008 at 17:52 Comment(0)
S
0

Doing it right, there should be a small, insignificant speed advantage if you convert to lower case, but this is, as many has hinted, culture dependent and is not inherit in the function but in the strings you convert (lots of lower case letters means few assignments to memory) -- converting to upper case is faster if you have a string with lots of upper case letters.

Selhorst answered 4/6, 2010 at 15:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.