Conditional aggregation performance
Asked Answered
T

2

22

Let us have the following data

 IF OBJECT_ID('dbo.LogTable', 'U') IS NOT NULL  DROP TABLE dbo.LogTable

 SELECT TOP 100000 DATEADD(day, ( ABS(CHECKSUM(NEWID())) % 65530 ), 0) datesent 
 INTO [LogTable]
 FROM    sys.sysobjects
 CROSS JOIN sys.all_columns

I want to count the number of rows, the number of last year rows and the number of last ten years rows. This can be achieved using conditional aggregation query or using subqueries as follows

-- conditional aggregation query
SELECT
    COUNT(*) AS all_cnt,
    SUM(CASE WHEN datesent > DATEADD(year,-1,GETDATE())
             THEN 1 ELSE 0 END) AS last_year_cnt,
    SUM(CASE WHEN datesent > DATEADD(year,-10,GETDATE())
             THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable


-- subqueries
SELECT
(
    SELECT count(*) FROM LogTable 
) all_cnt, 
(
    SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-1,GETDATE())
) last_year_cnt,
(
    SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-10,GETDATE())
) last_ten_year_cnt

If you perform the queries and look on query plans then you see something like

enter image description here

Clearly, the first solution has much nicer query plan, cost estimation and even the SQL command looks more concise and fancy. However, if you measure the CPU time of the query using SET STATISTICS TIME ON I get the following results (I have measured several times with approximately the same results)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 47 ms,  elapsed time = 41 ms.

(1 row(s) affected)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 26 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Therefore, the second solution has slightly better (or the same) performance than the solution using conditional aggregation. The difference becomes more evident if we create the index on datesent attribute.

CREATE INDEX ix_logtable_datesent ON dbo.LogTable(DateSent)

Then the second solution starts to use Index Seek instead of Table Scan and its query CPU time performance drops to 16ms on my computer.

My questions are two: (1) why the conditional aggregation solution does not outperform the subquery solution at least in the case without index, (2) is it possible to create 'index' for the conditional aggregation solution (or rewrite the conditional aggregation query) in order to avoid scan, or is conditional aggregation generally unsuitable if we are concerned about performance?

Sidenote: I can say, that this scenario is quite optimistic for conditional aggregation since we select the number of all rows which always leads to a solution using scan. If the number of all rows is not needed, then indexed solution with subqueries has no scan, whereas, the solution with conditional aggregation has to perform the scan anyway.

EDIT

Vladimir Baranov basically answered the first question (thank you very much). However, the second question remains. I can see on StackOverflow answers using conditional aggregation solutions quite offten and they attract a lot of attention being accepted as the most elegant and clear solution (and sometimes being proposed as the most efficient solution). Therefore, I will slightly generalize the question:

Could you give me an example, where conditional aggregation notably outperforms the subquery solution?

For simplicity let us assume that physical accesses are not present (data are in Buffer cache) since the today database servers remain most of their data in the memory anyway.

Tern answered 21/8, 2017 at 11:25 Comment(4)
Here is a great article which delves into your problem in depth. It seems that with an index on the WHERE column, the separate subqueries outperform the conditional aggregations.Hardworking
But this doesn't mean that conditional aggregations are evil; there are some scenarios where they are needed to get an answer you need. Also, your performance would depend on the size of the tables, whether you can tolerate adding a certain index, etc.Hardworking
@TimBiegeleisen it is a nice blog post, however, it does explain the difference. I even raise more questions, since he investigated the IO as well and found out that subquery solution has several times more IOs (which is also this case). So where the slow down of conditional aggregation comes from?Castro
It may be due to the difference between a COUNT(*) where SQL Server is just counting the rows and a SUM where SQL Server adds either 1 or 0 for all rows.Lathery
P
22

Short summary

  • Performance of subqueries method depends on the data distribution.
  • Performance of conditional aggregation does not depend on the data distribution.

Subqueries method can be faster or slower than conditional aggregation, it depends on the data distribution.

Naturally, if the table has a suitable index, then subqueries are likely to benefit from it, because index would allow to scan only the relevant part of the table instead of the full scan. Having a suitable index is unlikely to significantly benefit the Conditional aggregation method, because it will scan the full index anyway. The only benefit would be if the index is narrower than the table and engine would have to read fewer pages into memory.

Knowing this you can decide which method to choose.


First test

I made a larger test table, with 5M rows. There were no indexes on the table. I measured the IO and CPU stats using SQL Sentry Plan Explorer. I used SQL Server 2014 SP1-CU7 (12.0.4459.0) Express 64-bit for these tests.

Indeed, your original queries behaved as you described, i.e. subqueries were faster even though the reads were 3 times higher.

After few tries on a table without an index I rewrote your conditional aggregate and added variables to hold the value of DATEADD expressions.

Overall time became significantly faster.

Then I replaced SUM with COUNT and it became a little bit faster again.

After all, conditional aggregation became pretty much as fast as subqueries.

Warm the cache (CPU=375)

SELECT -- warm cache
    COUNT(*) AS all_cnt
FROM LogTable
OPTION (RECOMPILE);

Subqueries (CPU=1031)

SELECT -- subqueries
(
    SELECT count(*) FROM LogTable 
) all_cnt, 
(
    SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-1,GETDATE())
) last_year_cnt,
(
    SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,-10,GETDATE())
) last_ten_year_cnt
OPTION (RECOMPILE);

Original conditional aggregation (CPU=1641)

SELECT -- conditional original
    COUNT(*) AS all_cnt,
    SUM(CASE WHEN datesent > DATEADD(year,-1,GETDATE())
             THEN 1 ELSE 0 END) AS last_year_cnt,
    SUM(CASE WHEN datesent > DATEADD(year,-10,GETDATE())
             THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables (CPU=1078)

DECLARE @VarYear1 datetime = DATEADD(year,-1,GETDATE());
DECLARE @VarYear10 datetime = DATEADD(year,-10,GETDATE());

SELECT -- conditional variables
    COUNT(*) AS all_cnt,
    SUM(CASE WHEN datesent > @VarYear1
             THEN 1 ELSE 0 END) AS last_year_cnt,
    SUM(CASE WHEN datesent > @VarYear10
             THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables and COUNT instead of SUM (CPU=1062)

SELECT -- conditional variable, count, not sum
    COUNT(*) AS all_cnt,
    COUNT(CASE WHEN datesent > @VarYear1
             THEN 1 ELSE NULL END) AS last_year_cnt,
    COUNT(CASE WHEN datesent > @VarYear10
             THEN 1 ELSE NULL END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

times

Based on these results my guess is that CASE invoked DATEADD for each row, while WHERE was smart enough to calculate it once. Plus COUNT is a tiny bit more efficient than SUM.

In the end, conditional aggregation is only slightly slower than subqueries (1062 vs 1031), maybe because WHERE is a bit more efficient than CASE in itself, and besides, WHERE filters out quite a few rows, so COUNT has to process less rows.


In practice I would use conditional aggregation, because I think that number of reads is more important. If your table is small to fit and stay in the buffer pool, then any query will be fast for the end user. But, if the table is larger than available memory, then I expect that reading from disk would slow subqueries significantly.


Second test

On the other hand, filtering the rows out as early as possible is also important.

Here is a slight variation of the test, which demonstrates it. Here I set the threshold to be GETDATE() + 100 years, to make sure that no rows satisfy the filter criteria.

Warm the cache (CPU=344)

SELECT -- warm cache
    COUNT(*) AS all_cnt
FROM LogTable
OPTION (RECOMPILE);

Subqueries (CPU=500)

SELECT -- subqueries
(
    SELECT count(*) FROM LogTable 
) all_cnt, 
(
    SELECT count(*) FROM LogTable WHERE datesent > DATEADD(year,100,GETDATE())
) last_year_cnt
OPTION (RECOMPILE);

Original conditional aggregation (CPU=937)

SELECT -- conditional original
    COUNT(*) AS all_cnt,
    SUM(CASE WHEN datesent > DATEADD(year,100,GETDATE())
             THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables (CPU=750)

DECLARE @VarYear100 datetime = DATEADD(year,100,GETDATE());

SELECT -- conditional variables
    COUNT(*) AS all_cnt,
    SUM(CASE WHEN datesent > @VarYear100
             THEN 1 ELSE 0 END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

Conditional aggregation with variables and COUNT instead of SUM (CPU=750)

SELECT -- conditional variable, count, not sum
    COUNT(*) AS all_cnt,
    COUNT(CASE WHEN datesent > @VarYear100
             THEN 1 ELSE NULL END) AS last_ten_year_cnt
FROM LogTable
OPTION (RECOMPILE);

times2

Below is a plan with subqueries. You can see that 0 rows went into the Stream Aggregate in the second subquery, all of them were filtered out at the Table Scan step.

plan_subqueries

As a result, subqueries are again faster.

Third test

Here I changed the filtering criteria of the previous test: all > were replaced with <. As a result, the conditional COUNT counted all rows instead of none. Surprise, surprise! Conditional aggregation query took same 750 ms, while subqueries became 813 instead of 500.

times3

Here is the plan for subqueries:

plan_subqueries3

Could you give me an example, where conditional aggregation notably outperforms the subquery solution?

Here it is. Performance of subqueries method depends on the data distribution. Performance of conditional aggregation does not depend on the data distribution.

Subqueries method can be faster or slower than conditional aggregation, it depends on the data distribution.

Knowing this you can decide which method to choose.


Bonus details

If you hover the mouse over the Table Scan operator you can see the Actual Data Size in different variants.

  1. Simple COUNT(*):

data size count

  1. Conditional aggregation:

data size conditional

  1. Subquery in test 2:

data size subquery test2

  1. Subquery in test 3:

data size subquery test3

Now it becomes clear that the difference in performance is likely caused by the difference in the amount of data that flows through the plan.

In case of simple COUNT(*) there is no Output list (no column values are needed) and data size is smallest (43MB).

In case of conditional aggregation this amount doesn't change between tests 2 and 3, it is always 72MB. Output list has one column datesent.

In case of subqueries, this amount does change depending on the data distribution.

Pomfrey answered 21/8, 2017 at 12:30 Comment(6)
Nice insight. Have you tried indexing datesent or your results are purely without it?Castro
In the test described there was no index.Pomfrey
Ok, so it seems that if you give enough attention how you use conditional aggregation then you can achieve a similar performance to a subquery solution. Nevertheless, if an index is involved the subquery solution dominates.Castro
@RadimBača, your and my tests used a pretty small table that fits into memory. If your table is so large that the engine has to read pages from disk, extra scan or seek of the table may be very expensive. It also depends on the data distribution. If you have an index and the WHERE condition filters out most of the rows, then the query would be very efficient. If the WHERE condition keeps most of rows for processing, reading them again from disk could be expensive.Pomfrey
thank you for answering the first question. I have generalized the second one, however, let's stick to a solution that fits in memory.Castro
Really excellent answer. I would just slightly alter the conclusion. "Performance of subqueries method depends on the data distribution and existence of indexes.". Similarly, indexes can be mentioned in the case of conditional aggregation. I consider the index usage important for the conclusion since the conditional aggregation can not use any.Castro
W
0

Here's my example where subqueries on large tables were extremely slow (around 40-50sec) and I was given the advice to rewrite the query with FILTER (Conditional Aggregation) which sped it up to 1sec. I was amazed.

Now I always use FILTER Conditional Aggregation because you only join on the large tables just once, and all the retrieval is done with FILTER. It's a bad idea to sub-select on large tables.

Thread: SQL Performance Issues with Inner Selects in Postgres for tabulated report

I needed a tabulated report, as follows,

Example (easy flat stuff first, then the complicated tabulated stuff):

RecallID | RecallDate | Event |..| WalkAlone | WalkWithPartner |..| ExerciseAtGym
256      | 10-01-19   | Exrcs |..| NULL      | NULL            |..| yes
256      | 10-01-19   | Walk  |..| yes       | NULL            |..| NULL
256      | 10-01-19   | Eat   |..| NULL      | NULL            |..| NULL
257      | 10-01-19   | Exrcs |..| NULL      | NULL            |..| yes

My SQL had Inner Selects for the tabulated answer-based columns, and looked like this:

select 
-- Easy flat stuff first
r.id as recallid, r.recall_date as recalldate, ... ,

-- Example of Tabulated Columns:
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l 
where l.id=aq.answer_choice_id and aq.question_id=13 
and aq.id=ans.activity_question_id and aq.activity_id=27 and ans.event_id=e.id) 
     as transportationotherintensity,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=66 and l.id=aq.answer_choice_id and aq.question_id=14
and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as commutework,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=67 and l.id=aq.answer_choice_id and aq.question_id=14 and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as commuteschool,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=95 and l.id=aq.answer_choice_id and aq.question_id=14 and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as dropoffpickup,

The performance was horrible. Gordon Linoff recommended the one-time Join on the large table ANSWERS_T with FILTER as appropriate on all the tabulated Selects. That sped it up to 1 sec.

select ans.event_id,
       max(l.description) filter (where aq.question_id = 13 and aq.activity_id = 27) as transportationotherintensity
       max(l.description) filter (where l.id = 66 and aq.question_id = 14 and aq.activity_id = 67) as commutework,
       . . .
from activity_questions_t aq join
     lookup_t l 
     on l.id = aq.answer_choice_id join
     answers_t ans
     on aq.id = ans.activity_question_id
group by ans.event_id
Wiliness answered 4/5, 2020 at 15:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.