I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
- What's the deal? Why does Redshift have this limitation if it claims to be a column store?
- Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
- Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
SELECT COUNT(DISTINCT field_0001) FROM <table> WHERE field_0002 >= 0;
The total response times, including network protocol, etc, were; 7.7s, 2.9s, 1.1s, 0.5s, 0.3s. I'm gob-smacked, but it's repeatable. – BrethrenSELECT MAX(1) FROM <table> WHERE 1=0
: 3.6s, 1.6s, 0.55s, 0.25s, 0.18s. This does match my other experiences; RedShift makes long running analytics scalable and faster, but makes short running analytics slower (This overhead that you have found becomes dominant). – Brethren