String StartsWith() issue with Danish text
Asked Answered
M

3

3

Can anyone explain this behaviour?

var culture = new CultureInfo("da-DK");
Thread.CurrentThread.CurrentCulture = culture;
"daab".StartsWith("da"); //false

I know that it can be fixed by specifying StringComparison.InvariantCulture. But I'm just confused by the behavior.

I also know that "aA" and "AA" are not considered the same in a Danish case-insensitive comparision, see http://msdn.microsoft.com/en-us/library/xk2wykcz.aspx. Which explains this

String.Compare("aA", "AA", new CultureInfo("da-DK"), CompareOptions.IgnoreCase) // -1 (not equal)

Is this linked to the behavior of the first code snippet?

Maximo answered 30/6, 2011 at 13:57 Comment(5)
It seems like the second a gives the first one another context. So aa is basically considered as one entity. But I cant tell whether its a bug or a feature, because I do not know the danish language.Nondisjunction
Right. See the wikipedia article about the danish/norwegian alphabet, especially the part "history": en.wikipedia.org/wiki/Danish_and_Norwegian_alphabetUtopia
I agree. "aa" in Danish ("å" in modern Danish) is a different letter from "a", therefore "daab" doesn't start with "da", just as "dåb" doesn't start with "da". (You'll have to check whether "å" is the same as "aa"; in theory it should be.)Bismarck
"daab".StartsWith("då") also returns false... apparently the Danish language works in mysterious ways, unless it's the .NET Framework ;)Christianson
In danish ae = æ, oe = ø, aa = å. Æ Ø Å (here writen in alphabetic order) are the only three special characters in danish. ae, oe, aa are remnants from the past, and never used in the everyday language, only in proper nouns. More importantly, the letters can also be used as a word, e.g. 'ae' means 'stroke/pat'. And im pretty sure they are also used as a part of a word, where they do not represent æøå, but i cant remember one of these words right now. Ill try look for one.Andorra
J
5

Like Nappy said, its a feature of the danish language, where "aa" and "å" is still the same. Danish got another two letters, æ and ø, but I am not sure if they can be written using two letters as well.

I think in the second example "aA" is not changed while "AA" is changed to "Å". Just to confuse things even more, "Aa" is considered equal to "AA" and "aa" only when using case-insensitive comparing.

Jokester answered 30/6, 2011 at 14:28 Comment(3)
What I personally do not understand is why this is relevant in STRING-Comparisons, because for example oe is not considered the same like ö in the german de-DE culture. So why it is so important in the danish language?Nondisjunction
Not an expert on danish or german, but I think aa is still very common in danish. I know that in Norwegian aa is only rarely used in old family and place namesJokester
I was recently stumped by this exact curiosity ([#15548163), and Martin Brendan is correct - many surnames, and even some of the largest cities still use "aa" (e.g. Aarhus, Aalborg). Without it, a search using "aa" wouldn't return my last name :)Enchase
A
6

Here a test that illustrates the problem, daab og dåb (same word in old and modern language respectively) means baptism/christening.

public class can_handle_remnant_of_danish_language
{
    [Fact]
    public void daab_start_with_då()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("daab".StartsWith("då")); // Fails
    }

    [Fact]
    public void daab_start_with_da()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("daab".StartsWith("da")); // Fails
    }

    [Fact]
    public void daab_start_with_daa()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("daab".StartsWith("daa")); // Succeeds
    }

    [Fact]
    public void dåb_start_with_daa()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("dåb".StartsWith("daa")); // Fails
    }

    [Fact]
    public void dåb_start_with_da()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("dåb".StartsWith("da")); // Fails
    }

    [Fact]
    public void dåb_start_with_då()
    {
        var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
        Assert.True("dåb".StartsWith("då")); // Succeeds
    }
}

All the above tests should be successfull with my understanding of the language, and im danish! I aint got no degree in grammar though. :-)

Seems like a bug to me.

Andorra answered 1/7, 2011 at 7:57 Comment(1)
even more inconsistent: 'foer', which both means 'linning' and 'foer' as in 'før' which means 'before'. given the same setup as above, the test 'foer_start_with_fo' does not fail in the same way as the test 'daab_start_with_da'.Andorra
J
5

Like Nappy said, its a feature of the danish language, where "aa" and "å" is still the same. Danish got another two letters, æ and ø, but I am not sure if they can be written using two letters as well.

I think in the second example "aA" is not changed while "AA" is changed to "Å". Just to confuse things even more, "Aa" is considered equal to "AA" and "aa" only when using case-insensitive comparing.

Jokester answered 30/6, 2011 at 14:28 Comment(3)
What I personally do not understand is why this is relevant in STRING-Comparisons, because for example oe is not considered the same like ö in the german de-DE culture. So why it is so important in the danish language?Nondisjunction
Not an expert on danish or german, but I think aa is still very common in danish. I know that in Norwegian aa is only rarely used in old family and place namesJokester
I was recently stumped by this exact curiosity ([#15548163), and Martin Brendan is correct - many surnames, and even some of the largest cities still use "aa" (e.g. Aarhus, Aalborg). Without it, a search using "aa" wouldn't return my last name :)Enchase
A
0

The modern spelling of "baptism" in Danish, namely dåb, is certainly not considered to start with da, for a Danophone. If daab is supposed to be an old-fashioned spelling of dåb, it is a bit philosophical whether it starts with da or not. But for (modern) collation purposes, it does not (alphabetically, such daab goes after disk, not before).

However, if your string is not supposed to represent natural language, but is instead some kind of technical code, like hexadecimal digits, surely you do not want to use any culture-specific rules. The solution here is not to use the invariant culture. The invariant culture has (English) rules itself!

Instead, you want to use ordinal comparison.

Ordinal comparison simply compares the strings char by char, without any assumptions of what sequences are "equivalent" in some sense. (Technical remark: Each char is a UTF-16 code unit, not a "character". Ordinal comparison is ignorant of the rules of Unicode normalization.)

I think the confusion arises because, by default, some string methods use a culture-aware comparison, and other string methods use the ordinal comparison.

The following examples all use a culture-aware comparison:

"Straße".StartsWith("Strasse", StringComparison.CurrentCulture)
"Straße".Equals("Strasse", StringComparison.CurrentCulture)
"ne\u0301e".StartsWith("née", StringComparison.CurrentCulture)
"ne\u0301e".Equals("née", StringComparison.CurrentCulture)

"Straße".StartsWith("Strasse")  // CurrentCulture is default for 'StartsWith'!
"ne\u0301e".StartsWith("née")   // CurrentCulture is default for 'StartsWith'!

Each of the above may depend on the .NET version as well! (As an example, the first one gives true if the current culture is the invariant culture and you are under .NET Framework 4.8; but it gives false if the current culture is the invariant culture and you use .NET 6.)

But these examples use ordinal comparison:

"Straße".StartsWith("Strasse", StringComparison.Ordinal)
"Straße".Equals("Strasse", StringComparison.Ordinal)
"ne\u0301e".StartsWith("née", StringComparison.Ordinal)
"ne\u0301e".Equals("née", StringComparison.Ordinal)

"Straße".Equals("Strasse")  // Ordinal is default for 'Equals'!
"ne\u0301e".Equals("née")   // Ordinal is default for 'Equals'!

So remember to check what the default comparison is for the string method you use, and specify the opposite one if needed. (Or always specify the comparison, even when redundant, if you prefer.)

Antonelli answered 27/4, 2022 at 14:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.