AWS Redshift column limit?
Asked Answered
T

1

8

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.

What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.

More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.

SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'

My three questions are:

  1. What's the deal? Why does Redshift have this limitation if it claims to be a column store?
  2. Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
  3. Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
Tother answered 3/9, 2015 at 15:24 Comment(14)
If you need more than 1600 columns there is a strong likelihood that your data is poorly structured. You should look for opportunities to normalise your data (as you say, in to multiple tables). The limitation on the number of columns is simply a factor of the optimisation engine and the number of references it stores, probably a legacy from the version of PostGreSQL that it came from. The column limit and whether it is columnar are completely unrelated. As for the drop in performance, I've not seen that before. Is your query Exactly as above?Brethren
Oh, and if you're only dealing with 10M x 2.5k then you shouldn't need RedShift. I'd use PostGreSQL for something that small. I'm using RedShift for Trillions of rows distributed of dozens/hundreds of nodes.Brethren
@MatBailie, performance has to be subsecond, which is why we decided to go with Redshift. I'm pretty certain that one of the major advantages of a column store database is being able to pull any arbitrary column without a hit associated with other columns. You can go straight to the columns of data that you need, load those, and that's it. You're completely isolated from the other columns. Finally, nope, my data is structured fine. I literally have that many number of completely unrelated attributes that I'd like to query against. Think of a segmentation use case. Thanks.Tother
@MatBailie, also, interesting that it's based off of PostgreSQL.Tother
Doing some more reading online, it seems as though many column-store databases actually have a column limit. Interesting.Tother
Do you require SQL for complex queries, or are you just retrieving specific values? If the latter, then DynamoDB would definitely be a faster solution.Roselba
@JohnRotenstein, no, DynamoDB doesn't seem to work for my use case because it appears as though there are only max 20 attributes which you can restrict on per table.Tother
You can put a JSON blog into a single DynamoDB attribute, so the number of "fields" can actually be significantly more. There is no limit on the number of attributes within the JSON record, but there is a 400K limit per item (row). See: aws.amazon.com/blogs/aws/dynamodb-update-json-and-moreRoselba
Do note that RedShift's per-query overhead is huge compared to most transactional (oltp) databases. Also, it's query plan generation time is relatively huge, so your first query always takes longer than the next run of the same query. This also means that parameterisation of queries (even tiny simple queries) is vastly important. Even then I've seen query cache misses more often that I'm used to. Overall this means; redshift excels at long running analytical queries, but underperforms at short running queries (in my experience). I wouldn't use it for 1s response times.Brethren
I'm also aware that AlterTable DDL and lack of statistics or vacuuming can have varying degrees of impact on performance. So much that to alter a schema we export the data, drop the table, recreate the table and reimport the data. Horrible, but managed by our ORM platform. Maybe if you give psuedocode for how you have tested we may spot possible steps to avoid. (don't be clever, be clean, even if it's slow and laborious).Brethren
My results agree with yours, it appears the ideal number of columns is somewhere near 400. I created tables with different numbers of INT columns (1600, 1200, 800, 400, 200), populated with the same set of randomly generated 1million rows. Then I vauumed and analysed all the tables and ran the following query on each SELECT COUNT(DISTINCT field_0001) FROM <table> WHERE field_0002 >= 0; The total response times, including network protocol, etc, were; 7.7s, 2.9s, 1.1s, 0.5s, 0.3s. I'm gob-smacked, but it's repeatable.Brethren
More confusingly, RedShift's own logs on the Execution time is vastly different; 214ms, 230ms, 90ms, 206ms, 89ms. This indicates to me that there is a one-time query overhead (such as locking resources) that is dependant on the number of columns. This is further confirmed by running this query against each table SELECT MAX(1) FROM <table> WHERE 1=0 : 3.6s, 1.6s, 0.55s, 0.25s, 0.18s. This does match my other experiences; RedShift makes long running analytics scalable and faster, but makes short running analytics slower (This overhead that you have found becomes dominant).Brethren
Hi for the query that you are executing, you may try to use the explain to see how the query processing happens. If the query parses multiple non unique columns, then there is a sequence scan involved Do you distribute the data across multiple compute nodes in the cluster. During distribution, data is copied across clusters to execute the query.Montero
@Tother Can you point me to where you found information indicating that "queries slow down as the number of columns increases in a table"Crump
C
5

I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.

I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.

  • 1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
  • Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.

Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?

FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

Clueless answered 4/9, 2015 at 15:46 Comment(4)
Hmm. So any idea what db would better suit my use case? Horizontally scalable, high availability db and sub second count queries with simple where clauses against a large number of attributes (3k) and approx 10M rows?Tother
We evaluated MemSQL whilst trying to resolve this issue. It is insanely fast but only on the second run of a given query. The first run is very slow because they deeply compile it using GCC. For us, because the queries are very ad-hoc, it was better to stay with Redshift and use the bit-wise functions. You could also try Google's BigQuery (I hear good things).Clueless
I realize that this conversation is over 3 years old, but since it came up on my Google search it may appear for others. It may be worth using Spectrum for this, which can handle more columns and can be queried via Redshift, without the overhead of MVCC controls (since the data stays in S3).Harwill
@Tother since so much time has gone by, I'm curious if you have an opinion as to the best features/products to handle such a wide table in the AWS ecosystem.Pawnshop

© 2022 - 2024 — McMap. All rights reserved.