Optimizing SUM OVER PARTITION BY for several hierarchical groups

Asked 24/5, 2018 at 19:29 Answered 24/5, 2018 at 21:4

Solved sql sql-server window-functions partition-by

I have a table like below:

Region    Country    Manufacturer    Brand    Period    Spend
R1        C1         M1              B1       2016      5
R1        C1         M1              B1       2017      10
R1        C1         M1              B1       2017      20
R1        C1         M1              B2       2016      15
R1        C1         M1              B3       2017      20
R1        C2         M1              B1       2017      5
R1        C2         M2              B4       2017      25
R1        C2         M2              B5       2017      30
R2        C3         M1              B1       2017      35
R2        C3         M2              B4       2017      40
R2        C3         M2              B5       2017      45

I need to find SUM([Spend] over different groups as follow:

Total Spend over all the rows in the whole table
Total Spend for each Region
Total Spend for each Region and Country group
Total Spend for each Region, Country and Advertiser group

So I wrote this query below:

SELECT 
    [Period]
    ,[Region]
    ,[Country]
    ,[Manufacturer]
    ,[Brand]
    ,SUM([Spend]) OVER (PARTITION BY [Period]) AS [SumOfSpendWorld]
    ,SUM([Spend]) OVER (PARTITION BY [Period], [Region]) AS [SumOfSpendRegion]
    ,SUM([Spend]) OVER (PARTITION BY [Period], [Region], [Country]) AS [SumOfSpendCountry]
    ,SUM([Spend]) OVER (PARTITION BY [Period], [Region], [Country], [Manufacturer]) AS [SumOfSpendManufacturer]
FROM myTable

But that query takes >15 minutes for a table of just 450K rows. I'd like to know if there is any way to optimize this performance. Thank you in advanced for your answers/suggestions!

Rodenhouse answered 24/5, 2018 at 19:29 Comment(8)

Have you tried 4 separate queries and joining their results? I'd imagine that would be faster. Also Year(period) is killing you here. Make that a separate column and then aggregate over that. – Fireweed 24/5, 2018 at 19:30

That doesn't seem right . . . unless some of those key columns are really, really, really big. – Granivorous 24/5, 2018 at 19:31

Why you are using year() as it is already as numeric values ? – Webb 24/5, 2018 at 19:32

@Daniel Marcus why do you belive that reading the whole table 4 times would be faster than reading it only once ? – Frymire 24/5, 2018 at 19:35

@YogeshSharma . . . The OP probably took a shortcut in describing the data in the question. I doubt the other columns have real data either. – Granivorous 24/5, 2018 at 19:36

@krokodilko, figured the window functions are expensive so it might make sense to split them up – Fireweed 24/5, 2018 at 19:42

@YogeshSharma I removed the YEAR([Period]) by pre-processing it in myTable, but it's still running for longer than 4 minutes now. – Rodenhouse 24/5, 2018 at 19:43

@user1330974, I think Yogesh's point was that period is already a numeric year value so no need to do anything to it – Fireweed 24/5, 2018 at 19:44

Your description of the problem suggests grouping sets to me:

SELECT YEAR([Period]) AS [Period], [Region], [Country], [Manufacturer], 
       SUM([Spend])
GROUP BY GROUPING SETS ( (YEAR([Period]),
                         (YEAR([Period]), [Region]),
                         (YEAR([Period]), [Region], [Country]), 
                         (YEAR([Period]), [Region], [Country], [Manufacturer])
                        );

I don't know if this will be faster, but it certainly seems more aligned with your question.

Granivorous answered 24/5, 2018 at 19:34 Comment(1)

This works! The query finishes in less than 10 secs!! :) Thank you very much! – Rodenhouse 24/5, 2018 at 20:54

Use cross apply here to speed the query up:

 SELECT 
     periodyear
    ,[Region]
    ,[Country]
    ,[Manufacturer]
    ,[Brand]
    ,SUM([Spend]) OVER (PARTITION BY  periodyear AS [SumOfSpendWorld]
    ,SUM([Spend]) OVER (PARTITION BY  periodyear, [Region]) AS [SumOfSpendRegion]
    ,SUM([Spend]) OVER (PARTITION BY  periodyear, [Region], [Country]) AS [SumOfSpendCountry]
    ,SUM([Spend]) OVER (PARTITION BY  periodyear, [Region], [Country], [Manufacturer]) AS [SumOfSpendManufacturer]
FROM myTable
  cross apply (select YEAR([Period]) periodyear) a

Fireweed answered 24/5, 2018 at 19:33 Comment(5)

@DanielMarcus . . . Can you give any insight whatsoever on why you think this might have any effect on performance? – Granivorous 24/5, 2018 at 19:35

@GordonLinoff, I would certainly defer to you on this - Do you disagree? My thought was using cross apply we would only have to parse year(period) once as opposed to doing it multiple times in each partition. – Fireweed 24/5, 2018 at 19:37

@DanielMarcus . . . And calling year() even many times on 480k rows is not going to result in 15 minutes for the query. That is just trivial, relative to the other work going one for the window functions. – Granivorous 24/5, 2018 at 19:39

@GordonLinoff good to know thanks. So why do you think the OP query is taking 15 min? – Fireweed 24/5, 2018 at 19:40

. . To be honest, I do not know. Given the constraints in the problem, it seems much too long. – Granivorous 25/5, 2018 at 3:6

Old school of SUM() OVER():

SELECT 
      [Period]
    , [Region]
    , [Country]
    , [Manufacturer]
    , [Brand]
    , (SELECT SUM([Spend]) FROM myTable t WHERE e.[Period] = t.[Period] GROUP BY [Period]) AS [SumOfSpendWorld]
    , (SELECT SUM([Spend]) FROM myTable t WHERE e.[Period] = t.[Period] AND e.Region = t.Region GROUP BY [Period], [Region] ) AS [SumOfSpendRegion]
    , (SELECT SUM([Spend]) FROM myTable t WHERE e.[Period] = t.[Period] AND e.Region = t.Region AND e.Country = t.Country GROUP BY [Period], [Region], [Country] ) AS [SumOfSpendCountry]
    , (SELECT SUM([Spend]) FROM myTable t WHERE e.[Period] = t.[Period] AND e.Region = t.Region AND e.Country = t.Country AND e.Manufacturer = t.Manufacturer GROUP BY [Period], [Region], [Country], [Manufacturer] ) AS [SumOfSpendManufacturer]
FROM myTable e

While this is not the elegant way to do it, but it gets the job done. I would highly recommend looking over the table and analyze it to see which alternative approaches would be best for your situation. If you feel it's a dead-end, then I would suggest using temp tables to make things faster. For instance, you could select the rows based on period and use bulk copy to insert them directly to the temp table, then do your magic. I've seen tables that forced me to use temp tables instead of a simple select query. Others forced me to extend the table into two tables.

So, it's not always going to be nice and clean !

I hope this would give you another insight that would help you in your journey.

Ginter answered 24/5, 2018 at 21:4 Comment(1)

Thank you. :) I'm now starting to realize that I cannot run this query LIVE (meaning on-demand by the user). – Rodenhouse 24/5, 2018 at 21:11

Recommended topics

Hot tags