Difference between Pig and Hive? Why have both? [closed]
Asked Answered
E

19

259

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link).

I understand that-

  • Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely resembles SQL.

  • Pig sits on top of Hadoop and in principle can also sit on top of Dryad. I might be wrong but Hive is closely coupled to Hadoop.

  • Both Pig Latin and Hive commands compiles to Map and Reduce jobs.

My question - What is the goal of having both when one (say Pig) could serve the purpose. Is it just because Pig is evangelized by Yahoo! and Hive by Facebook?

Equity answered 28/7, 2010 at 18:42 Comment(2)
Hive is for structured data . Pig is for unstructured data.Summertime
Note for current readers: Pig has not seen much innovation and is considered deprecated by many. The most answers below do not reflect this as they were written some time ago.Modestine
B
152

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.

Boil answered 29/7, 2010 at 6:56 Comment(8)
Alan also does an article discussing Hive specifically, as shared j03m below. Good stuff from him!Ailbert
Hive is for structured data . Pig is for unstructered data.Summertime
I'm confused. Did you mean to say "[...] usefulness of a procedural language like Pig"? Because the article repeatedly claims that "Pig Latin is Procedural".Balkin
I'm not sure if it's temporary, but the article seems to be gone. Can you update the link (I couldn't find it with a quick search)?Halvah
It's back now, for me at least.Boil
Alan Gates post is here.. please go through it.. developer.yahoo.com/blogs/hadoop/…Diminuendo
I fixed the procedural/declarative typo in the answer to reflect what's stated in the linked article.Johny
The content is not available in the linkProfundity
T
57

Hive was designed to appeal to a community comfortable with SQL. Its philosophy was that we don't need yet another scripting language. Hive supports map and reduce transform scripts in the language of the user's choice (which can be embedded within SQL clauses). It is widely used in Facebook by analysts comfortable with SQL as well as by data miners programming in Python. SQL compatibility efforts in Pig have been abandoned AFAIK - so the difference between the two projects is very clear.

Supporting SQL syntax also means that it's possible to integrate with existing BI tools like Microstrategy. Hive has an ODBC/JDBC driver (that's a work in progress) that should allow this to happen in the near future. It's also beginning to add support for indexes which should allow support for drill-down queries common in such environments.

Finally--this is not pertinent to the question directly--Hive is a framework for performing analytic queries. While its dominant use is to query flat files, there's no reason why it cannot query other stores. Currently Hive can be used to query data stored in Hbase (which is a key-value store like those found in the guts of most RDBMSes), and the HadoopDB project has used Hive to query a federated RDBMS tier.

Talley answered 5/8, 2010 at 7:23 Comment(0)
M
37

I found this the most helpful (though, it's a year old) - http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

It specifically talks about Pig vs Hive and when and where they are employed at Yahoo. I found this very insightful. Some interesting notes:

On incremental changes/updates to data sets:

Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach. This will take only a few minutes. Standard database operations can be implemented in this incremental way in Pig Latin, making Pig a good tool for this use case.

On using other tools via streaming:

Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.

On using Hive for data warehousing:

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.

The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as ODBC.

Mestizo answered 22/11, 2011 at 20:4 Comment(6)
+1 great to see a comparison from Yahoo, who is, from what I understand the original creator of Pig, or at least a very big proponent. Edit: from Jakob above, I see that the author (Alan Gates) is the Pig Architect at Yahoo -- so great share :)Ailbert
The link is dead. I think the correct URL at this moment is: https://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html.Inquiry
Updated link per aboveMestizo
another new link: yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahooSorci
The 2 links shared above is no more found.Gazo
great article, very informative cheers :)Interpreter
D
28

Hive is better than PIG in: Partitions, Server, Web interface & JDBC/ODBC support.

Some differences:

  1. Hive is best for structured Data & PIG is best for semi structured data

  2. Hive is used as a declarative SQL & PIG as a procedural language

  3. Hive supports partitions & PIG does not

  4. Hive defines tables with (schema) and stores schema information in a database & PIG doesn't have a dedicated metadata of database

  5. Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically.

Detritus answered 26/10, 2015 at 18:45 Comment(0)
S
17

I believe that the real answer to your question is that they are/were independent projects and there was no centrally coordinated goal. They were in different spaces early on and have grown to overlap with time as both projects expand.

Paraphrased from the Hadoop O'Reilly book:

Pig: a dataflow language and environment for exploring very large datasets.

Hive: a distributed data warehouse

Stabilizer answered 28/7, 2010 at 19:8 Comment(1)
Hive is nothing like a RDBMS. It processes flat files just like Pig. They both basically do the same thing. Look at the optimizers that they use when compiling the job as that is the largest real difference.Gingili
I
12

You can achieve similar results with pig/hive queries. The main difference lies within approach to understanding/writing/creating queries.

Pig tends to create a flow of data: small steps where in each you do some processing
Hive gives you SQL-like language to operate on your data, so transformation from RDBMS is much easier (Pig can be easier for someone who had not earlier experience with SQL)

It is also worth noting, that for Hive you can nice interface to work with this data (Beeswax for HUE, or Hive web interface), and it also gives you metastore for information about your data (schema, etc) which is useful as a central information about your data.

I use both Hive and Pig, for different queries (I use that one where I can write query faster/easier, I do it this way mostly ad-hoc queries) - they can use the same data as an input. But currently I'm doing much of my work through Beeswax.

Inenarrable answered 28/7, 2010 at 20:27 Comment(0)
P
12

Pig allows one to load data and user code at any point in the pipeline. This is can be particularly important if the data is a streaming data, for example data from satellites or instruments.

Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it can be worked upon. So if you were using Hive on streaming data, you would have to keep filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep storing the newly arriving data.

Pig also uses lazy evaluation. It allows greater ease of programming and one can use it to analyze data in different ways with more freedom than in an SQL like language like Hive. So if you really wanted to analyze matrices or patterns in some unstructured data you had, and wanted to do interesting calculations on them, with Pig you can go some fair distance, while with Hive, you need something else to play with the results.

Pig is faster in the data import but slower in actual execution than an RDBMS friendly language like Hive.

Pig is well suited to parallelization and so it possibly has an edge for systems where the datasets are huge, i.e. in systems where you are concerned more about the throughput of your results than the latency (the time to get any particular datum of result).

Phlox answered 22/3, 2014 at 14:4 Comment(0)
H
11

Hive Vs Pig-

Hive is as SQL interface which allows sql savvy users or Other tools like Tableu/Microstrategy/any other tool or language that has sql interface..

PIG is more like a ETL pipeline..with step by step commands like declaring variables, looping, iterating , conditional statements etc.

I prefer writing Pig scripts over hive QL when I want to write complex step by step logic. When I am comfortable writing a single sql for pulling the data i want i use Hive. for hive you will need to define table before querying(as you do in RDBMS)

The purpose of both are different but under the hood, both do the same, convert to map reduce programs.Also the Apache open source community is add more and more features to both there projects

Hawthorn answered 24/12, 2015 at 17:55 Comment(0)
S
8

Read the difference between PIG and HIVE in this link.

http://www.aptibook.com/Articles/Pig-and-hive-advantages-disadvantages-features

All the aspects are given. If you are in the confusion which to choose then you must see that web page.

Stale answered 5/9, 2013 at 16:39 Comment(1)
Good article, but you should summarize it in the answer: meta.stackexchange.com/questions/8231/…Not
C
7
  1. Pig-latin is data flow style, is more suitable for software engineer. While sql is more suitable for analytics person who are get used to sql. For complex task, for hive you have to manually to create temporary table to store intermediate data, but it is not necessary for pig.

  2. Pig-latin is suitable for complicated data structure( like small graph). There's a data structure in pig called DataBag which is a collection of Tuple. Sometimes you need to calculate metrics which involve multiple tuples ( there's a hidden link between tuples, in this case I would call it graph). In this case, it is very easy to write a UDF to calculate the metrics which involve multiple tuples. Of course it could be done in hive, but it is not so convenient as it is in pig.

  3. Writing UDF in pig much is easier than in Hive in my opinion.

  4. Pig has no metadata support, (or it is optional, in future it may integrate hcatalog). Hive has tables' metadata stored in database.

  5. You can debug pig script in local environment, but it would be hard for hive to do that. The reason is point 3. You need to set up hive metadata in your local environment, very time consuming.

Carmellacarmelle answered 15/7, 2013 at 23:37 Comment(0)
T
5

I found below useful link to explore how and when to use HIVE and PIG.

http://www.hadoopwizard.com/when-to-use-pig-latin-versus-hive-sql/

Triboluminescence answered 20/9, 2013 at 7:11 Comment(0)
L
4

Here are some additional links on to use Pig or Hive.

http://aws.amazon.com/elasticmapreduce/faqs/#hive-8

http://www.larsgeorge.com/2009/10/hive-vs-pig.html

Lieb answered 3/8, 2011 at 9:10 Comment(0)
C
4

From the link: http://www.aptibook.com/discuss-technical?uid=tech-hive4&question=What-kind-of-datawarehouse-application-is-suitable-for-Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing. So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

Cristinacristine answered 29/9, 2013 at 6:0 Comment(0)
A
4

In Simpler words, Pig is a high-level platform for creating MapReduce programs used with Hadoop, using pig scripts we will process the large amount of data into desired format.

Once the processed data obtained, this processed data is kept in HDFS for later processing to obtain the desired results.

On top of the stored processed data we will apply HIVE SQL commands to get the desired results, internally this hive sql commands runs MAP Reduce programs.

Angleaangler answered 7/1, 2014 at 1:56 Comment(1)
this isnt really a meaningful addition to the knowledge base. try adding more infoUterus
G
4

When we are using Hadoop in the sense it means we are trying to huge data processing The end goal of the data processing would be to generate content/reports out of it.

So it internally consists of 2 prime activities:

1) Loading Data Processing

2) Generate content and use it for the reporting /etc..

Loading /Data Processing -> Pig would be helpful in it.

This helps as an ETL (We can perform etl operations using pig scripts.).

Once the result is processed we can use hive to generate the reports based on the processed result.

Hive: Its built on top of hdfs for warehouse processing.

We can generate adhoc reports easily using hive from the processed content generated from pig.

Ghazi answered 29/5, 2014 at 3:45 Comment(0)
N
4

To give a very high level overview of both, in short:

1) Pig is a relational algebra over hadoop

2) Hive is a SQL over hadoop (one level above Pig)

Nickelson answered 4/10, 2014 at 7:56 Comment(1)
Algebra comparison is interestingDetritus
W
3

What HIVE can do which is not possible in PIG?

Partitioning can be done using HIVE but not in PIG, it is a way of bypassing the output.

What PIG can do which is not possible in HIVE?

Positional referencing - Even when you dont have field names, we can reference using the position like $0 - for first field, $1 for second and so on.

And another fundamental difference is, PIG doesn't need a schema to write the values but HIVE does need a schema.

You can connect from any external application to HIVE using JDBC and others but not with PIG.

Note: Both runs on top of HDFS (hadoop distributed file system) and the statements are converted to Map Reduce programs.

Watteau answered 29/3, 2015 at 4:32 Comment(0)
K
1

Pig eats anything! Meaning it can consume unstructured data.

Hive requires a schema.

Kwangju answered 20/2, 2015 at 17:55 Comment(0)
L
1

Pig is useful for ETL kind of workloads generally speaking. For example set of transformations you need to do to your data every day.

Hive shines when you need to run adhoc queries or just want to explore data. It sometimes can act as interface to your visualisation Layer ( Tableau/Qlikview).

Both are essential and serve different purpose.

Lamonica answered 13/11, 2015 at 20:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.