Why does Redshift not need materialized views or indexes?
Asked Answered
N

5

7

In the Redshift FAQ under

Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?

It says the following:

Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.

Why is this the case?

Neuberger answered 31/5, 2016 at 13:41 Comment(2)
As of 2019, Redshift does have Materialized views. aws.amazon.com/about-aws/whats-new/2019/11/…Runofthemine
As of 2023, incremental materialized views are one of the big Redshift advantages. How the times have changed!Delete
O
9

It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.

Materialised Views

I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.

I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?

Indexes

RedShift does not have indexes.

It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).

It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.

That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics

As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns

Overtrump answered 31/5, 2016 at 14:1 Comment(4)
If I were to make materialized views in redshift manually, should I just be creating and dropping tables on an interval?Neuberger
@AR7 - That's up to you. We deal with multi-terabyte data sets. Rebuilding that whole table would be punitive to say the least. RedShift's UPDATE behaviour is to soft-delete a record (until a VACUUM) and INSERT the new data in an unsorted portion of the table. For that reason we simply DELETE anything that's changed or gone, then INSERT anything that's changed or is new. Then deal with VACUUM and ANALYSE in house-keeping phases. Re-build would avoid the unsorted blocks, and is itself faster than VACUUM. It's a trade off.Overtrump
Do you have any resources you'd recommend for working with redshift? I'm pretty new to using it, and currently there's not that much data in there, but it's bound to grow and I'd rather not be unprepared. I don't know much about vacuuming, or best practices with redshift, and it'd be nice to learn more about it other than what amazon has in their docs.Neuberger
@AR7 - Amazon's own documents are the best resource to be honest. And there is quite a lot in there. Next best is just to search for Blogs about RedShift and scour through the mixture of mis-information and search for genuine pearls. Much of what I've learned has been mixing both of the above with actual real world experiences. Not the ideal way to learn quickly, but a good way to learn the truth (if you can be regimented enough).Overtrump
S
2

as of dec 2019, Redshift has a preview of materialized views: Announcement

from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.

Swordbill answered 9/12, 2019 at 16:59 Comment(0)
S
1

The simple answer is: because it can read the needed data really, really fast and in parallel.

One of the primary uses of indexes are "needle-in-the-haystack" queries. These are queries where only a relatively small number of rows are needed and these match a WHERE clause. Columnar datastores handle these differently. The entire column is read into memory -- but only the column, not the rest of the row's data. This is sort of similar to having an index on each column, except the values need to be scanned for the match (that is where the parallelism comes in handy).

Other uses of indexes are for matching key pairs for joining or for aggregations. These can be handled by alternative hash-based algorithms.

As for materialized views, RedShift's strength is not updating data. Many such queries are quite fast enough without materialization. And, materialization incurs a lot of overhead for maintaining the data in a high transaction environment. If you don't have a high transaction environment, then you can increment temporary tables after batch loads.

Shackelford answered 31/5, 2016 at 13:58 Comment(3)
Ah ok that makes sense. Could you clarify what If you don't have a high transaction environment, then you can increment temporary tables after batch loads. means? I'm not quite sure I follow.Neuberger
As far as I am aware, and from my experience, RedShift does not use a entire column is read into memory paradigm. Instead it is even more fine grained than that. Columns are fragmented into 1MB blocks with summary statistics that allow certain blocks to not be read at all. In fact, if the field is unique, the summary statistics allow the engine to identify a single 1MB block to be read and the rest ignored.Overtrump
@Overtrump . . . My understanding is that the "page" header contains the minimum and maximum values for the column on the page. This is a tremendous benefit for "sorted" columns (say, an auto incremented id or insertion time). It can be beneficial for other columns, but that is not necessarily the case. Of course, these things may change over time, so my understanding could be out-of-date.Shackelford
F
1

Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.

The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.

https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1

The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.

Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.

for eg. Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys. If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate). If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently. If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently. If you have an interleaved soft key on (orderid, shipdate) and if your query

Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.

https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html

Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks

Farcy answered 31/5, 2016 at 15:38 Comment(0)
D
0

They recently added support for Materialized Views in Redshift: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-redshift-introduces-support-for-materialized-views-preview/

Syntax for materialized view creation:

CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[ table_attributes ]   
AS query 

Syntax for refreshing a materialized view:

REFRESH MATERIALIZED VIEW mv_name
Duchy answered 16/1, 2020 at 9:4 Comment(1)
is there any way to schedule MV refresh in RS. One easy way is to trigger scheduled Lambdas that execute refresh commands on the db. Would like to know if there are any native RS capabilities tot do such.Decennial

© 2022 - 2024 — McMap. All rights reserved.