What does it mean that string and character comparisons in Swift are not locale-sensitive?
Asked Answered
R

3

16

I started learning Swift language and I am very curious What does it mean that string and character comparisons in Swift are not locale-sensitive? Does it mean that all the characters are stored in Swift like UTF-8 characters?

Reld answered 7/9, 2014 at 19:30 Comment(0)
U
29

(All code examples updated for Swift 3 now.)

Comparing Swift strings with < does a lexicographical comparison based on the so-called "Unicode Normalization Form D" (which can be computed with decomposedStringWithCanonicalMapping)

For example, the decomposition of

"ä" = U+00E4 = LATIN SMALL LETTER A WITH DIAERESIS

is the sequence of two Unicode code points

U+0061,U+0308 = LATIN SMALL LETTER A + COMBINING DIAERESIS

For demonstration purposes, I have written a small String extension which dumps the contents of the String as an array of Unicode code points:

extension String {
    var unicodeData : String {
        return self.unicodeScalars.map {
            String(format: "%04X", $0.value)
            }.joined(separator: ",")
    }
}

Now lets take some strings, sort them with <:

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
print(someStrings)
// ["a", "ã", "ă", "ä", "ǟ", "b"]

and dump the Unicode code points of each string (in original and decomposed form) in the sorted array:

for str in someStrings {
    print("\(str)  \(str.unicodeData)  \(str.decomposedStringWithCanonicalMapping.unicodeData)")
}

The output

äx  00E4,0078  0061,0308,0078
ǟx  01DF,0078  0061,0308,0304,0078
ǟψ  01DF,03C8  0061,0308,0304,03C8
äψ  00E4,03C8  0061,0308,03C8

nicely shows that the comparison is done by a lexicographic ordering of the Unicode code points in the decomposed form.

This is also true for strings of more than one character, as the following example shows. With

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()

the output of above loop is

äx  00E4,0078  0061,0308,0078
ǟx  01DF,0078  0061,0308,0304,0078
ǟψ  01DF,03C8  0061,0308,0304,03C8
äψ  00E4,03C8  0061,0308,03C8

which means that

"äx" < "ǟx", but "äψ" > "ǟψ"

(which was at least unexpected for me).

Finally let's compare this with a locale-sensitive ordering, for example swedish:

let locale = Locale(identifier: "sv") // svenska
var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
someStrings.sort {
    $0.compare($1, locale: locale) == .orderedAscending
}

print(someStrings)
// ["a", "ă", "ã", "b", "ä", "ǟ"]

As you see, the result is different from the Swift < sorting.

Ungenerous answered 10/9, 2014 at 21:7 Comment(8)
Addition/details ("official" ref.: from open source): from the String.swift source code we can see that e.g. the < operator for String is defined as lhs._compareString(rhs) < 0 (which use _swift_stdlib_unicode_compare_utf8_utf8, itself), which we can track via github.com/apple/swift/blob/master/stdlib/public/stubs/… to ucol_strcollIter (see MakeRootCollator for collator settings) from ICU lib; i.e., using the unicode collation algorithm.Chemotherapy
... (link to relevant ICU lib)Chemotherapy
@dfri: Thanks for providing the links, much appreciated. I think it was also mentioned somewhere in Apple's documentation, but I cannot find it anymore.Ungenerous
Happy to help. I also recall I've seen some mention of this in the docs, but no had success when I tried to find it earlier today. Seems like the Swift reference docs change quicker than the stdlib itself, and only by ninja-edits :)Chemotherapy
Your last example confused me as I am Swedish and the order should be [a, b, å, ä] (not sure about ă and ã but I guess they should be between a and b) since å and ä are separate letters that comes after z in the Swedish alphabet. After some time I realised that you entered locale identifier "se", which is the country code for Sweden but the language code for Northern Sami. The correct language code for Swedish is "sv" :)Zebadiah
@LoPoBo: I apologize to all Swedish speaking people! With language code "sv" the result is ["a", "ă", "ã", "b", "ä", "ǟ"] – does that look correct? I will update the answer tonight (and update it for Swift 3 as well). Thanks for your feedback, much appreciated!Ungenerous
Yes that seems correct. When I wrote [a, b, å, ä] I had mistaken ǟ for å (they look quite similar in the code font). Only a, b and ä are regular letters of the Swedish alphabet, but the order of the other ones seems logical.Zebadiah
Does the < operator guarantee transitivity (a < b and b < c implies a < c)? I ran into an example where this breaks, and I posted the example here #46230971 . From your explanation I think it should guarantee transitivity but obviously this is not the case.Sartorial
M
1

Changing the locale can change the alphabetical order, e.g. a case-sensitive comparison can appear case-insensitive because of the locale, or more generally, the alphabetical order of two strings is different.

Mayence answered 7/9, 2014 at 19:36 Comment(2)
Does it mean that Swift stores its own table of all possible characters or it uses any standard like Unicode, etc?Reld
No, it doesn't mean that. It means the same as setting LC_ALL=C which means that we're comparing pure byte-values.Mayence
C
1

Lexicographical ordering and locale-sensitive ordering can be different. You can see an example of it in this question: Sorting scala list equivalent to C# without changing C# order

In that specific case the locale-sensitive ordering placed _ before 1, whereas in a lexicographical ordering it's the opposite.

Swift comparison uses lexicographical ordering.

Curley answered 7/9, 2014 at 19:49 Comment(2)
Does lexicographical ordering mean alphabetical order? What about characters from different alphabets (each country has its own alphabet), how it knows what characters to consider as first?Reld
@GabrielePetronella: That's what I thought as well, but all the expressions "a" < "ä", "ä" < "b" and ClosedInterval("a", "b").contains("ä") return true in my test project.Ungenerous

© 2022 - 2024 — McMap. All rights reserved.