STRING_AGG not behaving as expected
Asked Answered
M

1

18

I have the following query:

WITH cteCountryLanguageMapping AS (
    SELECT * FROM (
        VALUES
            ('Spain', 'English'),
            ('Spain', 'Spanish'),
            ('Sweden', 'English'),
            ('Switzerland', 'English'),
            ('Switzerland', 'French'),
            ('Switzerland', 'German'),
            ('Switzerland', 'Italian')
    ) x ([Country], [Language])
)
SELECT
    [Country],
    CASE COUNT([Language])
        WHEN 1 THEN MAX([Language])
        WHEN 2 THEN STRING_AGG([Language], ' and ')
        ELSE STRING_AGG([Language], ', ')
    END AS [Languages],
    COUNT([Language]) AS [LanguageCount]
FROM cteCountryLanguageMapping
GROUP BY [Country]

I was expecting the value inside Languages column for Switzerland to be comma separated i.e.:

  | Country     | Languages                                 | LanguageCount
--+-------------+-------------------------------------------+--------------
1 | Spain       | Spanish and English                       | 2
2 | Sweden      | English                                   | 1
3 | Switzerland | French, German, Italian, English          | 4

Instead I am getting the below output (the 4 values are separated by and):

  | Country     | Languages                                 | LanguageCount
--+-------------+-------------------------------------------+--------------
1 | Spain       | Spanish and English                       | 2
2 | Sweden      | English                                   | 1
3 | Switzerland | French and German and Italian and English | 4

What am I missing?


Here is another example:

SELECT y, STRING_AGG(z, '+') AS STRING_AGG_PLUS, STRING_AGG(z, '-') AS STRING_AGG_MINUS
FROM (
    VALUES
        (1, 'a'),
        (1, 'b')
) x (y, z)
GROUP by y

  | y | STRING_AGG_PLUS | STRING_AGG_MINUS
--+---+-----------------+-----------------
1 | 1 | a+b             | a+b

Is this a bug in SQL Server?

Mitchelmitchell answered 27/9, 2018 at 9:14 Comment(7)
Seems to be a bug, it's always returning the 1st STRING_AGG no matter how you write the CASEChisolm
This is a beauty of an optimizer bug. Simpler and more striking repro: CASE COUNT([Language]) WHEN 1234567 THEN STRING_AGG([Language], ' and ') ELSE STRING_AGG([Language], ', ') END AS [Languages] (uses the 1234567 case) and CASE COUNT([Language]) WHEN 1234567 THEN STRING_AGG([Language], ' and ') END AS [Languages] (leave out ELSE -- now the match fails and the expression becomes NULL). No matter what the "correct" result should be, surely that's not it.Contemn
Not sure this is appropriate, but....LOL! Since this is not a lack of knowledge on your part, rather than a real thing that many would come across, I strongly recommend changing the question title to be more accurate than a generic "not working as intended", in order to maximize benefit for all.Strait
In the resulting execution plan, the second STRING_AGG is entirely missing and instead the expression is rebound to the first one, as if the CASE had said STRING_AGG([Language], ' and ') twice. Any subsequent CASEs are absorbed as well. Looks like something very weird is going on with subexpression elimination.Contemn
This bug seems to be specifically tuned to STRING_AGG. If the ELSE is changed to 'blargh' + STRING_AGG(...), you'll get 'blarghFrench and German..., so it improperly unifies the second STRING_AGG with the first. The simplest workaround is to change the ELSE expression to STRING_AGG([Language] + '', ', ') -- this defeats CSE, suggesting there's a bug where CSE ignores the second argument to STRING_AGG.Contemn
@JeroenMostert That workaround sounds like it would make a good answer.Parallelism
I'd like to add that the query works as expected in Azure SQL Database so hopefully a fix for the on-prem is in the works.Stationer
C
20

Yes, this is a Bug (tm), present in all versions of SQL Server 2017 (as of writing). It's fixed in Azure SQL Server and 2019 RC1. Specifically, the part in the optimizer that performs common subexpression elimination (ensuring that we don't calculate expressions more than necessary) improperly considers all expressions of the form STRING_AGG(x, <separator>) identical as long as x matches, no matter what <separator> is, and unifies these with the first calculated expression in the query.

One workaround is to make sure x does not match by performing some sort of (near-)identity transformation on it. Since we're dealing with strings, concatenating an empty one will do:

SELECT y, STRING_AGG(z, '+') AS STRING_AGG_PLUS, STRING_AGG('' + z, '-') AS STRING_AGG_MINUS
FROM (
    VALUES
        (1, 'a'),
        (1, 'b')
) x (y, z)
GROUP by y
Contemn answered 27/9, 2018 at 9:52 Comment(3)
I allowed myself to post this workaround on Azure FeedbackNecker
dba.stackexchange.com/a/257819/23463 indicates it is still broken, even in CU17.Frum
@RossPresser: retested it and sure enough, CU17 doesn't actually fix this. Amended the answer.Contemn

© 2022 - 2024 — McMap. All rights reserved.