Downsampling Time Series: Average vs Largest-Triangle-Three-Buckets
Asked Answered
C

1

6

I'm programming lines charts with the use of Flot Charts to display timeseries.

In order to reduce the number of points to display, I do a downsampling by applying an average function on every data points in the same hour.

Recently, I however discovered the Largest-Triangle-Three-Buckets algorithm: http://flot.base.is/

  1. What are the differences between using such algorithm against using a simple function like average (per minute, per hour, per day, ...)?

  2. To speed up long period queries, does it make sense to pre-calculate an sql table on server-side, by applying LTTB on each month of data, and let the client-side apply an other LTTB on the agregated data?

Capel answered 18/7, 2018 at 8:17 Comment(1)
The Master's thesis linked to in the page you've linked to details how the LTTB algorithm works and differs from a simple average function. Page 23 specifically uses pseudo code to demonstrate the algorithm.Marmolada
F
2

1: The problem with averages, for my purposes, is that they nuke large differences between samples- my peaks and valleys were more important than what was happening between them. The point of the 3buckets algorithm is to try to preserve those inflection points (peaks/valleys) while not worrying about showing you all the times the data was similar or the same.

So, in my case, where the data was generally all the same (or close enough-- temperature data) until sample X at which point a small % change was important to be shown in the graph, the buckets algorithm was perfect.

Also- since the buckets algorithm is parameterized, you can change the values (how much data to keep) and see what values nuke the most data while looking visually nearly-identical and decide how much data you can dispense with before your graph has had too much data removed.

The naive approach would be decimation (removing X out of N samples) but what happens if it's the outliers you care about and the algorithm nukes an outlier? So then you change your decimation so that if the difference is -too- great, then it doesn't nuke that sample. This is kind of a more sophisticated version of that concept.

2: depends on how quickly you can compute it all, if the data ever changes, various other factors. That's up to you. From my perspective, once my data was in the past and a sample was 'chosen' to represent the bucket's value, it won't be changed and I can save it and never recalculate again.

Since your question is a bit old, what'd you end up doing?

Freedafreedman answered 24/1, 2019 at 21:18 Comment(4)
Thank you very much, clear explanation. About Q2, for now LTTB is still applied on the fly on raw data. However if the client request large period (million datapoints), it could take several seconds to complete. Of course, an ideal would be that all graph render in few seconds. I work with batch, I process the entire dataset every day, and I can do computations quickly. So in my case, it sounds like good idea to precompute LTTB on each timeseries on my dataset, but it's unclear on the LTTB sampling parameter to apply, as each timeseries have different ranges. How do you process on your side?Capel
Calculating ahead of time is definitely the best idea, where possible. In my case, I was generating ~400k of JSON, I was able to get that down to ~6k without and noticeable different in the data, and even small differences (1-2% deviations) show up just fine. If I was you, I'd grab an example implementation and put your data through with it, tweaking the parameters (bucket size mainly) until it looks like a good compromise between size and visual fidelity to the data... My data needs everything on a sliding window 24hour basis, so I just implemented it by hand from the original paper.Freedafreedman
Thank you, impressing results. As far I unsterdand, LTTB pick the "best" datapoint from every bucket. I have the same implementation (server-side) as "flot.base.is", so I can configure a Threshold (max number of output datapoints), but I can't tweak directly the bucket size. My problem is I have timeseries with multiples period length (Jan. 2018 to Dec. 2018, Jan. 1910 to Jan. 2019, etc ...), so basically I can't precompute LTTB with a constant Threshold. Do you think it is a good idea, for each timeseries in my dataset, to apply LTTB with a dynamic threshold regarding the period length ?Capel
Not sure, and the ultimate issue is that the algorithm can't tell you what's best for your situation, you have to run it, graph it, determine if you can get away with eliminating more data... The critical issue for me was that we not lose inflection points, and it definitely works very well for that situation. I'd play with it.Freedafreedman

© 2022 - 2024 — McMap. All rights reserved.