Selecting a continuous aggregate is slower than selecting raw data in timescaledb
Asked Answered
E

1

6

In my database (Postgresql 12; timescaleDB 1.7.0) there are multiple metrics tables containing one row per minute and device. It contains a deviceId, the time, four doubles and an enum value.

There are different time based queries to analyze the data such as drawing a graph on 12h slices of it or selecting the aggregated status of the last 5m.

To improve query performance I set up timescale's continuous aggregate views for the 12h case which drastically improved query time because everything is being precalculated. I tried the same for the much smaller slices of 5m expecting an improvement as the data for each query will be much smaller, albeit not as drastic as in the 12h example. Surprisingly, the opposite was the case. Selecting the raw data is now much faster than selecting the aggregate view which I don't really understand.

Here's the definition of my view:

CREATE VIEW metric_5m
            WITH ( timescaledb.continuous,
            timescaledb.refresh_interval = '5 minutes' )
AS
SELECT device,
       time_bucket('5 minutes', time)   as "time_bucket",
       max(metric.maximum) as "maximum",
       min(metric.minimum) as "minimum",
       avg(metric.average) as "average",
       avg(metric.sd)      as "sd"
FROM metric
GROUP BY time_bucket, device;

Selecting the raw data (in my test setup ~3.6 million rows) takes around 300ms while selecting the view takes around 3500ms. I suspect I'm somehow using it wrong or with too small intervals because it performs pretty well for the 12h example, but I couldn't find why.

So, thankful for every help on this!

Extraordinary answered 1/9, 2020 at 12:36 Comment(0)
M
5

Your guess is correct that the observed slow query execution over the continuous aggregate is due to too small intervals. The materialisation of the continuous aggregates stores partials, which are then used to calculate the final aggregations. This takes space and time. So continuous aggregates give significant advantage with bigger intervals, and it is more efficient to execute aggregate queries on hypertables directly for small intervals.

I am not aware that anyone investigated how to estimate the grouping interval when continuous aggregates pay off. I expect that it depends on number of aggregates, datatypes in aggregates, and aggregate kinds, since different aggregates will have different partials. For example, avg requires more partial information than sum or count. This blogpost gives some details about continuous aggregates and how they are materialized with partials.

You can try to see if the compression can help to improve performance, since it will reduce amount of data read from the disk and compressed data can be organised by the grouping columns.

Mangan answered 1/9, 2020 at 18:28 Comment(3)
Thanks a lot for your answer! Do you know of any resource or best practice on how to decide a good range for a continuous aggregate?Extraordinary
@Extraordinary No, I don't think there is any resources with insights. It will depend on number of aggregates in the query and aggregates themselves, since partials are different for different aggregates. E.g., partials for avg is bigger than for max.Mangan
Too bad, would be really helpful to have some docs on this or at least on how the partials are compiled. But thanks!Extraordinary

© 2022 - 2024 — McMap. All rights reserved.