Complementary to Joe Harris's answer, I'll provide a deeper explanation of his sort key compression recommendation:
While Redshift is a columnar database for disk storage purposes, the sort and distribution key operate on full rows. To sort your data, the columns in your sort key are converted to 8 byte integers; then their bits are concatenated (compound sort style) or interleaved (interleaved sort style) to form a sortable value. This is all done on uncompressed values, so is unaffected by your choice of compression. In many cases the sort key gives the intuitive "ascending" sort order, but due to the intermediate 8 byte integer conversion, it can give unexpected results (see full details here); the 8 byte limit can also make the sort key less effective/useful if you are storing large values.
Redshift stores min/max 8 byte integer values for every block, which is viewable in STV_BLOCKLIST
. However, these values are different than the full concatenated/interleaved sort key that Redshift is using to filter your data. The full values are instead stored in an internal zonemap, which, as far as I know, is inaccessible to end users. The zonemap structure is something like this:
Sortkey min |
Sortkey max |
Col0 min block |
Col0 max block |
Col1 min block |
Col1 max block |
... up to ColN |
0x001B... |
0x001C... |
5 |
10 |
0 |
2 |
... |
0x00A8 |
0x00FF... |
12 |
15 |
1 |
4 |
... |
The zonemap is generated from the sort key's first column. For every block of that first sort key column, we get its min/max row. We determine the min/max blocks that intersect those rows, for each column. From these min/max blocks, plus the cached min/max 8 byte integers from STV_BLOCKLIST
, we can assemble the full, composite sort key min/max values.
When a query filters by the sort key, it first converts the filter conditions to an equivalent composite sort key (e.g. concatentated/interleaved 8 byte integers). Redshift then will loop through entries of the zonemap, checking if the value falls between the min/max sort key. For those that do, it marks the blocks that need to be read for each column.
Since the zonemap is generated from the first sort key column, its effectiveness is tied to how many blocks that column takes up. The more blocks it has, the more entries will be in the zonemap, and so the more fine grained the filtering can be. For example, if the first sort key column is highly compressed down to 2 blocks, there will only be 2 entries in the zonemap. For your sort key to be maximally effective, the first sort key column must then occupy as many blocks as the maximum occupied by any column in the table. Setting the first sort key column to raw encoding will usually fulfill this condition.
Why Redshift behaves like this?
You would think Redshift could be more intelligent and index each column separately in its own zonemap. This would eliminate the dependency on the first sort key column's block count. However, it would require searching N zonemaps instead; and the Redshift creators have evidently determined that for the average Redshift user/workload, the extra filtering effectiveness is not worth that overhead.
You may also wonder why Redshift is generating the zonemap based on the first sort key column's blocks anyways, instead of say, one entry per 500 rows. Everything in Redshift is block based, e.g. caching min/max values for every block in STV_BLOCKLIST
. They would instead need to store an STV_ROWCHUNKLIST
of sorts, which stores min/max for chunks of 500 rows. This could improve filtering, but in some cases may require additional zonemap scanning if blocks contain > 500 rows. So once again I assume the Redshift creators have determined that for the average Redshift user/workload, the extra zonemap filtering overhead would not be worth it.