Hadoop, Hive, Pig, HBase, Cassandra - when to use what? [closed]

First of all I am relatively new to Big Data and the Hadoop world and I have just started to experiment a little with the Hortonworks Sandbox (Pig and Hive so far). I was wondering in which cases could I use the above mentioned tools of Hadoop, Hive, Pig, HBase and Cassandra?

In my sandbox environment with a file of just 9MB Hive and Pig had response times of seconds to minutes. This is obviously not usable in some situations for example web applications (unless it is something else such as my virtual machine setup).

My guesses about the correct usages are:

Hadoop: Just the technological base for the rest, only very few use-cases where it would be used directly
Hive or Pig: For analytical processes that run once per hour or day
HBase or Cassandra: for real-time applications (e.g. web applications) where response times with 100ms or less are required

Additionally, when to use HBase as opposed to when to use Cassandra?

Thanks!

Your guesses are somewhat accurate.

By Hadoop, I guess you are referring to MapReduce? Hadoop as such is an ecosystem which consists of many components (including MapReduce, HDFS, Pig and Hive).

MapReduce is good when you need to write the logic for processing data at the Map() and Reduce() method level. In my work, I find MapReduce very useful when I'm dealing with data that is unstructured & needs to be cleansed.

Hive,Pig: They are good for batch processes, running periodically (maybe in terms of hours or days)

HBase & Cassandra: Support low latency calls. So they can be used for real time applications, where response time is key. Have a look at this discussion to get a better idea about HBase vs Cassandra.

Recommended topics

Hot tags