How Do I aggregate Data By Day and Still Respect Timezone?

Asked 6/8, 2010 at 22:30 Answered 18/4, 2021 at 9:56

Solved sql database database-design data-warehouse

We are currently using a summary table that aggregates information for our users on an hourly basis in UTC time. The problem we are having is that this table is becoming too large and slowing our system down immensely. We have done all the tuning techniques recommended for PostgreSQL and we are still experiencing slowness.

Our idea was to start aggregating by day rather than by hour, but the problem is that we allow our customers to change the timezone, which recalculates the data for that day.

Does anyone know of a way to store the daily summary but still respect the numbers and totals when they switch timezones?

Liebowitz answered 6/8, 2010 at 22:30 Comment(5)

Are we talking potentially all timezones on Earth? – Leavy 6/8, 2010 at 22:32

Strictly speaking for data modeling, you're losing timezone level of detail when you go to day granularity. But, you may be able to aggregate by timezone, especially if the answer to @MPelletier's question is "No". – Intuitional 6/8, 2010 at 22:37

@Leavy we aggregate by hour right now, so we only support timezones that are "on the hour" – Liebowitz 6/8, 2010 at 22:39

@Intuitional There is no time zone information for the row of data, it is all in UTC. The change comes when a user wants to see how much they made on a certain day, in a certain timezone. – Liebowitz 6/8, 2010 at 22:41

to further clarify, if they made $1 an hour every hour (UTC) on the 1st and then on the 2nd and 3rd they made $2 an hour every hour (UTC) if they view the 2nd in EST it should add up to $20, and if they change their timezone to UTC then it will add up to $24 – Liebowitz 6/8, 2010 at 22:44

Summarise the data in tables with a timeoffset column, and a "day" field (a date) that is the day for that particular summary line. Index on (timeoffset, day, other relevant fields), clustered if possible (presumably PostgresSQL has clustered indexes?) and all should be well.

Nicholasnichole answered 6/8, 2010 at 22:39 Comment(9)

So, instead of 24 lines per day, one day would produce one line... times 24 time zones. I can't see substantial gain here. – Leavy 6/8, 2010 at 22:47

i thought about this, but then i have to maintain 24 summary tables which will also increase the possibility of a difference in reporting between timezones. – Liebowitz 6/8, 2010 at 22:53

@Leavy - the difference is that you don't need to aggregate the 24 lines for a day to produce a daily figure - you pull out the one summary line for that particular timeoffset / day - so you're doing 1/24 of the work - with proper indexing of course. – Nicholasnichole 6/8, 2010 at 23:4

@Russ - you don't have 24 summary tables - just the one, but with a timeoffset column which indicates the # of hours from GMT, and the day column indicates the day for that particular time offset. You'd have 24 rows for the 24 time offsets (or more or less if you need more or less time zones). – Nicholasnichole 6/8, 2010 at 23:5

the hourly table has over 10M rows already, which is why i beleive the performance is degrading. not necessarily in the number of rows it has to aggregate, but the total amount it has to filter. the indexes are growing too large i beleive. – Liebowitz 6/8, 2010 at 23:12

also, my clustered index is on two other important columns, and with postgres, you only get one per table – Liebowitz 6/8, 2010 at 23:13

@Russ Bradberry: Can you trick Postgre with views? Say you index the day, is there a gain for a timezone-specific view? Or the other way around: a nested query for days (indexed) inside a query for the timezone? – Leavy 7/8, 2010 at 5:34

@Will A, this may be viable with a columnar db solution. ive been looking into a couple and while it is a hefty initial overhead, it may prove to be a good solution. – Liebowitz 8/8, 2010 at 3:22

@Russ - I don't think you need worry about your existing clustered index - if you add timeoffset and day to the start of that index you'll still reap the benefit from it and will only need to scan through the relevant rows - it's the "24 tables" approach in one easier to maintain table. :) Will be interested in hearing how this goes. – Nicholasnichole 8/8, 2010 at 9:42

I'm assuming you've went through all the partitioning considerations, such as partitioning by user.

I can see several solutions to your problem, depending on the usage pattern.

Aggregate data per day, per user selection. In the event of timezone change, programatically recalculate the aggregate for this partner. This is plausible if timezone changes are infrequent and if a certain delay in data may be introduced when a user changes timezones.
If you have relatively few measures, you may maintain 24 columns for each measure - each describing the daily aggregate for the measure in a different timezone.
If timezone changes are frequent and there are numerous measures, it seems like 24 different aggregate tables would be the way to go.

Greasewood answered 7/8, 2010 at 22:58 Comment(2)

the timezone changes are, in fact, relatively few. i could programmatically recalculate the measures based on the change but the first change would have a significant delay. we have about 8 measures, 24 columns per measure would not be a good idea. im beginning to think that 24 tables is the way to go. i looked into @Will A's solution and it may be viable with a columnar db. but not with a db that degrades with the number of rows. – Liebowitz 8/8, 2010 at 3:21

192 integer columns isn't too bad, actually. And if you'll be using a columnar DB, I don't think you'll need any schema change whatsoever - at least not with the aforementioned problem in mind. – Greasewood 8/8, 2010 at 12:50

I met this problem too. I take this solution: the data with date type use local timezone, the other data with datetime type use UTC timezone, because the statistics index is local. Another reason is now we have only local data.

Grandmotherly answered 26/4, 2016 at 7:53 Comment(0)

I’m facing the same problem. I’m thinking about aggregating by date and time (hour by hour in UTC). Then you could fetch data accordingly for whatever time zone you want. Unfortunately, this won’t work if you need to support timezones where there is 45/30/15 minute offset. Alternatively, you could aggregate data by every 15 minutes to support every timezone.

Swot answered 18/4, 2021 at 9:56 Comment(0)

Recommended topics

Hot tags