Advantages of databases like Greenplum or Vertica compared to MongoDB or Cassandra [closed]

Asked 24/1, 2012 at 13:36 Answered 16/4, 2015 at 21:49

mongodb cassandra data-warehouse vertica greenplum

I am currently working in a few projects with MongoDB and Apache Cassandra respectively. I am also using Solr a lot and I am handling "lots" of data with them (approx. 1-2TB). I've heard of Greenplum and Vertica the first time in the last week and I am not really sure, where to put them in my brain. They seem to me like Dataware House (DWH) solutions and I haven't really worked DWH. And they seem to cost lots of money (e.g. $60k for 1TB storage in Greenplum). I am currently not handling Petabyte of data and won't do so I think, but products like cassandra seem also to be able to handle this

Cassandra is the acknowledged NoSQL leader when it comes to comfortably scaling to terabytes or petabytes of data.

via http://www.datastax.com/why-cassandra

So my question: Why should people use Greenplum & Co? Is there a huge advantage in comparison to these other products?

Thanks.

Starlin answered 24/1, 2012 at 13:36 Comment(0)

Cassandra, Greenplum and Vertica all handle huge amounts of data but in very different ways.

Some made up usecases where each database has its strengths:

Use cassandra for:

tweets.insert(key:user, data:blob);
tweets.get(key:user)

Use greenplum for:

begin;
update account set balance = balance - 10 where account_id = 1;
update account set balance = balance + 10 where account_id = 2;
commit;

Use Vertica for:

select sum(balance)
over (partition by region order by account rows unbounded preceding)
from transactions;

Lochner answered 17/2, 2012 at 23:40 Comment(0)

I work in the telecom industry. We deal with large data-sets and complex EDW(enterprise data warehouse) models.We started with Teradata and it was good for few years. Then the data increased exponentially, and as you know expansion in Teradata is expensive. So, we evaluated EMCs namely green plum, oracle exadata, hp Vertica and IBM netteza.

In speed, generation of 20 reports went like this: 1. Vertica, 2. Netteza, 3. green plum, 4. oracle

In compression ratio: Vertica had a natural advantage. Among others IBM is good too. The worst as per the benchmarks is emc and oracle. As always expected as its both want to sell ton of storage and hardware.

Scalability: All do scale well.

Loading time: emc is the best here, others (teradata , Vertica, oracle , IBM) are good too.

Concurrent user query :Vertica, emc, green plum, then only IBM. Oracle exadata is slow in any type of query case comparatively but much better than its old school 10g.

Price: Teradata > Oracle > IBM > HP > EMC

Note: Need to compare apple to apple, same no of cores ,ram,data volume, and reports

We chose Vertica for hardware independent pricing model, lower pricing and good performance. Now all 40+ users are happy to generate reports without waiting and it all fit in the low cost hp dl380 servers. it is great for olap /edw use case.

All this analysis is only for edw/analytics/olap case. I am still an oracle fan boy for all oltp, rich plsql, connectivity etc on any hardware or system. Exadata gives a decent mixed workload, but unreasonable in Price/performance ratio and still need to migrate 10g code to exadata best practice (sort of MMP like, bulk processing etc, and its time consuming than what they claim.

Rhesus answered 7/11, 2012 at 2:30 Comment(0)

We've been working in Hadoop for 4 years, and Vertica for 2. We had massive loading and indexing problems with our tables in MySQL. We were running on fumes with our home-grown sharding solution. We could have invested heavily in developing a more sophisticated sharding solution, which would have been quite painful, imo. We could have thought harder about what data we absolutely needed to keep in a SQL database.

But at the end of the day, switching from MySQL to Vertica was what we chose. Vertica performance patterns are quite different from MySQL's, which comes with its own headaches. But it can load a lot of data very quickly, and it is good at heavy duty queries that would make MySQL's head spin.

The way I see it, Vertica is a solution when you are already invested in SQL and need a heavier duty SQL database. I'm not an expert, so I couldn't tell you what a transition to Oracle or DB2 would have been like compared to Vertica, neither in terms of integration effort or monetary cost.

Vertica offers a lot of features we've barely looked into. Those might be very attractive to others with use cases different to ours.

Crucifixion answered 25/1, 2012 at 2:22 Comment(0)

I'm a Vertica DBA and prior to that was a developer with Vertica. Michael Stonebreaker (the guy behind Ingres, Vertica, and other databases) has some critiques of NoSQL that are worth listening to.

Basically, here are the advantages of Vertica as I see them:

it's rather fast on large amounts of data
it's performance is similar (so I can gather) to other data warehousing solutions but it's advantage is clustering and commodity hardware. So you can scale by adding more commodity hardware. It looks cheap in terms of overall cost per TB. (Going from memory not an exact quote.)
Again, it's for data warehousing.
You get to use traditional SQL and tables. It's under the hood that's different.

I can't speak to the other products, but I'm sure a lot of them are fine too.

Edit: Here's a talk from Stonebreaker: http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb

Touzle answered 25/1, 2012 at 15:36 Comment(0)

Pivotal, formerly Greenplum, is the well-funded spinoff from EMC, VMware and GE. Pivotal's market are enterprises (and Homeland Cybersecurity agencies) with multi-Petabyte size databases needing complex analytics and high speed ETL. Greenplum’s origin is a PostgreSQL DB redesigned for Map Reduced MPP, with later additions for columnar-support and HDFS. It marries the best of SQL + NoSQL making NewSQL.

Features:

In 2015H1 most of their code, including Greenplum DB & HAWQ, will go Open Source. Some advanced management & performance features at the top of the stack will remain proprietary.
MPP (Massively Parallel Processing) share-nothing RDBMS database designed for multi-terrabyte to multi-petabyte environments.
Full SQL Compliance - supporting all versions of SQL: ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2. •Only SQL over HADOOP capable of handling all 99 queries used by the TPC-DS benchmark standard without rewriting. The competition cannot do many of them and are significantly slower. SIGMON whitepaper.
ACID compliance.
Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files.
Solr/Lucene integration for multi-lingual full-text search embedded in the SQL.
Incorporates Open Source Software: Spring, Cloud Foundry, Redis.io, RabbitMQ, Grails, Groovy, Open Chorus, Pig, ZooKeeper, Mahout, MADlib, MapR. Some of these are used at EBSCO.
Native connectivity to HBase, which is a popular column-store-like technology for Hadoop.
VMware's participation in $150m investment in MongoDB will likely lead to integration of petabyte-scale XML files.
Table-by-table specification of distribution keys allow you to design your table schemas to take advantage of node-local joins and group bys, but will perform will even without this.
Row and/or Column-oriented data storage. It is the only database where a table can be polymorphic with both columnar and row-based partitions as defined by the DBA.
A column-store table can have a different compression algorithm per column because different datatypes have different compression characteristics to optimize their storage.
Advanced Map-Reduce-like CBO Query Optimizer – queries can be run on hundreds of thousands of nodes.
It is the only database with a dynamic distributed pipeline execution model for query processing. While older databases rely on materialized execution Greenplum doesn't have to write data to disk with every intermediate query step. It streams data to the next stage of a query plan in memory, and never has to materialize the data to disk, so it's much faster than what anybody has demonstrated on Hadoop.
Complex queries on large data sets are solved in seconds or even sub-seconds.
Data management – provides table statistics, table security.
Deep analytics – including data mining or machine learning algorithms using MADlib. Deep Semantic Textual Analytics using GPText.
Graphical Analysis - billion edge distributed in-memory graph database and algorithms using GraphLab.
Integration of SQL, Solr indexes, GPText, MADlib and GraphLab in a single query for massive syntactical parsing and graph/matrix affinity analysis for deep search analytics.
Fully ODBC/JDBC compliant.
Distributed ETL rate of 16 TB/hr!! Integration with Talend available.
Cloud support: Pivotal plans to package its Cloud Foundry software so that it can be used to host Pivotal atop other clouds as well, including Amazon Web Services' EC2. Pivotal data management will be available for use in a variety of cloud settings and will not be dependent on a proprietary VMware system. Will target OpenStack, vSphere, vCloud Director, or private brands. IBM announced it has standardized on Cloud Foundry for its PaaS. Confluence page.
Two hardware "appliance" offerings: Isilon NAS & Greenplum DCA.

Bloke answered 16/4, 2015 at 21:49 Comment(0)

There is a lot of confusion about when to use a row database like MySQL or Oracle or a columnar DB like Infobright or Vertica or a NoSQL variant or Hadoop. We wrote a white paper to try to help sort out which technologies are best suited for which use cases - you can download Emerging Database Landscape (scroll half way down) or watch an on-demand webinar on the same topic.

Hope either is useful for you

Maidenhead answered 25/1, 2012 at 19:28 Comment(3)

It should be mentioned that Vertica can ingest data from Hadoop. They aren't mutually exclusive. – Touzle 25/1, 2012 at 19:36

None of the links provided works. The Emerging Data Landscape shows 404, and the on demand webinar does not show any video listed. Care to update the links? – Antitoxin 2/12, 2013 at 20:49

Here is the updated link : tdwi.org/whitepapers/2011/10/… – Fleshly 19/2, 2015 at 9:48

Recommended topics

Hot tags