Columnar storage: Cassandra vs Redshift
Asked Answered
G

2

10

How is columnar storage in the context of a NoSQL database like Cassandra different from that in Redshift. If Cassandra is also a columnar storage then why isn't it used for OLAP applications like Redshift?

Gnome answered 10/10, 2018 at 11:27 Comment(5)
Why are you asking? do you have a use case in mind? this question is far too high level at the moment, with not enough information.Jolie
Cassandra has very limited version of SQL while OLAP applications ideally support complex SQL statements with window functions etc. Seems like the primary goal of columnar storage in Cassandra is only to distribute big data for high load OLTP services, not for distribution + fast aggregation like in OLAP engines.Parthenope
@JonScott I just want to understand how the column oriented storage in Redshift is different from that in Cassandra. If the data for each column is stored separately in Cassandra, what is it that makes it unsuitable for OLAP?Gnome
You CAN run OLAP on cassandra(+spark if you are serious about it). There is a lot more to the differences in architecture as well as columnar storage. I suggest you read up on the architectures if you are interested. Using spark on top of Cassandra could be interesting for OLAP, so why don't you try out both with some of your real world data. Redshift will take you no time to set up and use. Cassandra+Spark will be more challenging. I suggest the following resource. academy.datastax.com/units/…Jolie
For starters, Cassandra is not a colmnar data store.Transparency
Z
16

The storage engines of Cassandra and Redshift are very different, and are created for different cases. Cassandra's storage not really "columnar" in wide known meaning of this type of databases, like Redshift, Vertica etc, it is much more closer to key-value family in NoSQL world. The SQL syntax used in Cassandra is not any ANSI SQL, and it has very limited set of queries that can be ran there. Cassandra's engine built for fast writing and reading of records, based on key, while Redshift's engine is built for fast aggregations (MPP), and has wide support for analytical queries, and stores,encodes and compresses data on column level.

It can be easily understood with following example:

Suppose we have a table with user id and many metrics (for example weight, height, blood pressure etc...). I we will run aggregate the query in Redshift, like average weight, it will do the following (in best scenario):

  1. Master will send query to nodes.

  2. Only the data for this specific column will be fetched from storage.

  3. The query will be executed in parallel on all nodes.

  4. Final result will be fetched to master.

Running same query in Cassandra, will result in scan of all "rows", and each "row" can have several versions, and only the latest should be used in aggregation. If you familiar with any key-value store (Redis, Riak, DynamoDB etc..) it is less effective than scanning all keys there.

Cassandra many times used for analytical workflows with Spark, acting as a storage layer, while Spark acting as actual query engine, and basically shouldn't be used for analytical queries by its own. With each version released more and more aggregation capabilities are added, but it is very far from being real analytical database.

Zymo answered 11/10, 2018 at 11:5 Comment(0)
U
1

I encountered the same question today, and found that this resource on AWS: https://aws.amazon.com/nosql/columnar/

Upwards answered 11/3, 2021 at 5:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.