Why not to use timestamp with Interleaved Sortkey?
Asked Answered
F

4

11

I'm trying to figure out the different types of sortkeys in Amazon Redshift and I encountered a strange warning here, which is not explained:

Important: Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.

And yet, in their own example, Amazon uses interleaved key on a date column with good performance.

So, my question is - what's the explanation to this warning and should I take it seriously? More precisely - is there a problem with using interleaved key over a timestamp column?

Fernyak answered 11/2, 2019 at 13:2 Comment(2)
why not try it with your data and your queries - and seeArrestment
@JonScott I don't have permissions here to do that and I'm learning the issue to see if we can do better. So I wanted a theoretical answer first...Fernyak
A
5

I think it might have been explained later on when they describe issues around vacuuming/reindexing:

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort key columns and uses that information for optimal interleaving of the sort key columns. As a table grows, the distribution of the values in the sort key columns can change, or skew, especially with date or timestamp columns. If the skew becomes too large, performance might be affected.

So if that is the only reason, then it just means you will have increased maintenance on index.

Apothecary answered 11/2, 2019 at 13:40 Comment(3)
Thanks! I wonder if there is a calculation somewhere to see what is the cost of reindexing and how to know what is the right balance.Fernyak
There's another part of the docs (which I'm unable to find) which says something that boils down to there being almost no cases where it makes sense to use interleaved sort keys & as a general rule to use compound sort keysAdonai
Two follow up questions . . . does this mean that if the table is rebuilt daily, this is a non-issue? And.... if you really only want to index on one column only, isn't interleaved your only choice for sort key type? Thx!Salmonberry
S
6

From https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html

As you add rows to a sorted table that already contains data, the unsorted region grows, which has a significant effect on performance. The effect is greater when the table uses interleaved sorting, especially when the sort columns include data that increases monotonically, such as date or timestamp columns.

The key point in the original quote is not that that data is a date or timestamp, it's that it increases "monotonically", which in this context presumably means increasing sequentially such as an event timestamp or an Id number.

Sacha answered 13/2, 2019 at 9:40 Comment(0)
A
5

I think it might have been explained later on when they describe issues around vacuuming/reindexing:

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort key columns and uses that information for optimal interleaving of the sort key columns. As a table grows, the distribution of the values in the sort key columns can change, or skew, especially with date or timestamp columns. If the skew becomes too large, performance might be affected.

So if that is the only reason, then it just means you will have increased maintenance on index.

Apothecary answered 11/2, 2019 at 13:40 Comment(3)
Thanks! I wonder if there is a calculation somewhere to see what is the cost of reindexing and how to know what is the right balance.Fernyak
There's another part of the docs (which I'm unable to find) which says something that boils down to there being almost no cases where it makes sense to use interleaved sort keys & as a general rule to use compound sort keysAdonai
Two follow up questions . . . does this mean that if the table is rebuilt daily, this is a non-issue? And.... if you really only want to index on one column only, isn't interleaved your only choice for sort key type? Thx!Salmonberry
F
0

The date(not timestamp) column as a interleaved sort key makes sense when you know in an average X number of rows are processed everyday and you are going to filter based on it, if you are not going to use it then leave it out. Also a note on vacuum - when the VACUUM process is in progress, it needs temporary space to be able to complete the task by sorting and then merging the data in chunks. Cancelling the VACUUM process mid flight will cause extra spaces to be not reclaimed so if for some reason any Vacuum has ever been cancelled in your cluster this can be accounted to the space increase. See the link https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html#r_VACUUM_usage_notes and point 3 the last point is of particular interest. In my case the tables ended up growing very rapidly compared to the amount of rows inserted and had to build an auto table creation using deep copy

Farseeing answered 13/2, 2019 at 5:41 Comment(0)
M
0

Timestamp column may go to hours, minutes, seconds and milliseconds which is costly to sort the data. A data with milliseconds granularity is like having too much degree of zone maps to keep a record where the data starts and ends within dataset. Same is not true with date column in sort key. Date column has less degress of zone maps to keep a track of data residing in the table.

Macmullin answered 26/7, 2022 at 8:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.