STRING_AGG not behaving as expected

I have the following query:

WITH cteCountryLanguageMapping AS (
    SELECT * FROM (
        VALUES
            ('Spain', 'English'),
            ('Spain', 'Spanish'),
            ('Sweden', 'English'),
            ('Switzerland', 'English'),
            ('Switzerland', 'French'),
            ('Switzerland', 'German'),
            ('Switzerland', 'Italian')
    ) x ([Country], [Language])
)
SELECT
    [Country],
    CASE COUNT([Language])
        WHEN 1 THEN MAX([Language])
        WHEN 2 THEN STRING_AGG([Language], ' and ')
        ELSE STRING_AGG([Language], ', ')
    END AS [Languages],
    COUNT([Language]) AS [LanguageCount]
FROM cteCountryLanguageMapping
GROUP BY [Country]

I was expecting the value inside Languages column for Switzerland to be comma separated i.e.:

  | Country     | Languages                                 | LanguageCount
--+-------------+-------------------------------------------+--------------
1 | Spain       | Spanish and English                       | 2
2 | Sweden      | English                                   | 1
3 | Switzerland | French, German, Italian, English          | 4

Instead I am getting the below output (the 4 values are separated by and):

  | Country     | Languages                                 | LanguageCount
--+-------------+-------------------------------------------+--------------
1 | Spain       | Spanish and English                       | 2
2 | Sweden      | English                                   | 1
3 | Switzerland | French and German and Italian and English | 4

What am I missing?

Here is another example:

SELECT y, STRING_AGG(z, '+') AS STRING_AGG_PLUS, STRING_AGG(z, '-') AS STRING_AGG_MINUS
FROM (
    VALUES
        (1, 'a'),
        (1, 'b')
) x (y, z)
GROUP by y

  | y | STRING_AGG_PLUS | STRING_AGG_MINUS
--+---+-----------------+-----------------
1 | 1 | a+b             | a+b

Is this a bug in SQL Server?

Yes, this is a Bug (tm), present in all versions of SQL Server 2017 (as of writing). It's fixed in Azure SQL Server and 2019 RC1. Specifically, the part in the optimizer that performs common subexpression elimination (ensuring that we don't calculate expressions more than necessary) improperly considers all expressions of the form STRING_AGG(x, <separator>) identical as long as x matches, no matter what <separator> is, and unifies these with the first calculated expression in the query.

One workaround is to make sure x does not match by performing some sort of (near-)identity transformation on it. Since we're dealing with strings, concatenating an empty one will do:

SELECT y, STRING_AGG(z, '+') AS STRING_AGG_PLUS, STRING_AGG('' + z, '-') AS STRING_AGG_MINUS
FROM (
    VALUES
        (1, 'a'),
        (1, 'b')
) x (y, z)
GROUP by y

Recommended topics

Hot tags