ToUpperInvariant() – is MSDN wrong on its recommendation?
Asked Answered
H

1

11

In Best Practices for Using Strings in the .NET Framework, StringComparison OrdinalIgnoreCase is recommended for case-insensitive file paths. (Let's call it Statement A.)

I can agree with that, because I can create two files in the same directory:

é.txt
é.txt

Their filenames are not the same, second one is composed from e and modifier, so it actually has two letters. (You can try yourself using copy-paste.)

If there was Invariant culture comparison (and not ordinal comparison) in effect, NTFS wouldn't allow these files, because in the same article they explain, that in invariant culture a + ̊ = å

But in article on String.ToUpperInvariant() there is different recommendation: (Statement B.)

If you need the lowercase or uppercase version of an operating system identifier, such as a file name, named pipe, or registry key, use the ToLowerInvariant or ToUpperInvariant methods.

I need to create file path collection (in fact HashSet) to detect duplicates. So if I will obey statement B when creating the map, I could end with false positives, because abovementioned filenames é.txt and é.txt will be considered as one. Am I understanding it correctly that statement B found in MSDN is misleading? Or am I missing something?

I'm about to build a library, preferably without known bugs from start, so I simply don't want to neglect this.

Update:

Statement B seems to have one more issue: ToLowerInvariant() cannot be actually used. Reason (I quote Best practices article): DO: Use ToUpperInvariant rather than ToLowerInvariant when normalizing strings for comparison. Actual reason: There is a small range of characters that do not roundtrip, and going to lowercase will make these characters unavailable. (source)

Huneycutt answered 23/9, 2015 at 13:4 Comment(3)
I am not entirely sure "the lowercase or uppercase version of an operating system identifier" is meant to be the same as "an unambiguous mapping of an operating system identifier to a lowercase or uppercase version". It could also mean "a mapping of an operating system identifier to a non-unique lowercase or uppercase version that will work the same way regardless of the system's locale".Manpower
OT, but who knows what your library does: NTFS also allows :, * or ? in file names. It's just Windows that doesn't support it. It's quite easy to create such files on NTFS under Linux.Werner
@O.R.Mapper – a good way of reading of that statement... In this context it looks logical. On the other hand, they could either leave out mentioning file names or add a short note on (non-)uniqueness.Huneycutt
H
5

Neither uppercasing nor lowercasing is correct when you want to compare strings for equality case-insensitively. In both variants there are characters that mess this up.

The correct way to compare strings case-insensitively is to use one of the insensitive StringComparison options (you know that).

The right way to use a data structure case-insensitively is to use one of StringComparer.*IgnoreCase. For example:

new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)

Do not uppercase strings before adding them to a data structure. I would fail that in any code review.

If you need the lowercase or uppercase version of an operating system identifier

You do not need such as thing. This statement does not apply to your case.

Helbona answered 23/9, 2015 at 13:31 Comment(5)
So in case of NTFS filenames, this means new HashSet<string>(StringComparer.OrdinalIgnoreCase) (or just OrdinalCase, depending on how NTFS case sensitivity is switched in specific case).Huneycutt
I don't know what kind of comparison NTFS uses. It can be configured. There is a hidden file on each NTFS volume that stores the Unicode case mapping table. I guess it could be arbitrary. Not sure what it is in practice.Helbona
Yes I know that... It means we might actually need something like NtfsIgnoreCase comparison, working based on content of that hidden $UpCase file :)Huneycutt
See this answer of mine (for short: use OrdinalIgnoreCase for file names).Melburn
@LucasTrzesniewski – I've actually seen it :) and also this and this noteworthy Q&A. Finally I have used Dictionary(Of T1, T2)(StringComparer.OrdinalIgnoreCase) for my specific need.Huneycutt

© 2022 - 2024 — McMap. All rights reserved.