What is use of hcatalog in hadoop?

V

6

23

I'm new to Hadoop. I know that the HCatalog is a table and storage management layer for Hadoop. But how exactly it works and how to use it. Please give some simple example.

Valero answered 20/3, 2014 at 13:0 Comment(0)

A

8

HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.

It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).

Edit: Most of the text is from https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat.

Arenas answered 17/6, 2014 at 19:57 Comment(2)

nice copy paste from the official HCatalog page! :) – Mariano 19/5, 2015 at 5:49

@Mariano I think this answer screams 'RTFM!' in an elegant way. – Heighho 10/7, 2020 at 5:20

M

42

In short, HCatalog opens up the hive metadata to other mapreduce tools. Every mapreduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables). With having table based abstraction, HCatalog supported mapreduce tools do not need to care about where the data is stored, in which format and storage location (HBase or HDFS).

We do get the facility of WebHcat to submit jobs in an RESTful way if you configure webhcat along Hcatalog.

Mulct answered 11/11, 2014 at 11:46 Comment(1)

Superbly put! I have been roaming over the internet to understand what exactly hcat is. You explained it in 1 sentence. – Would 16/4, 2015 at 17:57

H

28

Here is a very basic example of how ho use HCATALOG.

I have a table in hive ,TABLE NAME is STUDENT which is stored in one of the HDFS location:

neethu 90 malini 90 sunitha 98 mrinal 56 ravi 90 joshua 8

Now suppose I want to load this table to pig for further transformation of data, In this scenario I can use HCATALOG:

When using table information from the Hive metastore with Pig, add the -useHCatalog option when invoking pig:

pig -useHCatalog

(you may want to export HCAT_HOME 'HCAT_HOME=/usr/lib/hive-hcatalog/')

Now loading this table to pig: A = LOAD 'student' USING org.apache.hcatalog.pig.HCatLoader();

Now you have loaded the table to pig.To check the schema , just do a DESCRIBE on the relation.

DESCRIBE A

Thanks

Hag answered 23/3, 2015 at 20:35 Comment(0)

H

16

Adding other great posts I would like add an image for clear understating of how HCatalog works and which layer it sits in cluster

how <code>HCatalog</code> works and where it sits in cluster

Q: how exactly it works?

As you mentioned "HCatalog is a table and storage management layer for Hadoop" Which gives high level abstraction to other frameworks like MR, Spark and Pig by performing I/O operations to Distributed storage layer for Hive tables.

HCatalog comprises 3 key elements

SerDe : Serialization and deserialization lib to process various data formats.
Meta store DB : Uses to stores the schema of Hive tables.
WebHCat/HCatalog REST : UI/REST layer on top of meta store DB for web clients.

Q: how to use it?

Once HCatalog installed and running successfully you do the following on CLI

usage: hcat { -e "<query>" | -f "<filepath>" } 
   [ -g "<group>" ] [ -p "<perms>" ] 
   [ -D"<name> = <value>" ]

-D <property = value>    use hadoop value for given property
-e <exec>                hcat command given from command line
-f <file>                hcat commands in file
-g <group>               group for the db/table specified in CREATE statement
-h,--help                Print help information
-p <perms>               permissions for the db/table specified in CREATE statement

Example:

./hcat –e "SELECT * FROM employee;"

Horology answered 20/10, 2017 at 7:33 Comment(2)

Well picturised! – Soria 20/10, 2017 at 15:55

out of the box answer. Picture adds more explanation. – Archangel 23/10, 2017 at 17:42

A

8

HCatalog supports reading and writing files in any format for which a Hive SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands.

It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).

Edit: Most of the text is from https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat.

Arenas answered 17/6, 2014 at 19:57 Comment(2)

nice copy paste from the official HCatalog page! :) – Mariano 19/5, 2015 at 5:49

@Mariano I think this answer screams 'RTFM!' in an elegant way. – Heighho 10/7, 2020 at 5:20

S

3

Hcatalog is the metadata management of Hadoop File system. Hcatalog can be accessed through webhcat which makes use of rest api. Whatever the tables created in hcatalog can be accessed through hive and pig.

Sulfapyrazine answered 17/6, 2014 at 19:49 Comment(0)

P

1

HCatalog is a table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools to easily write data in a tabular grid. It ensures that users don’t have to worry about the storage format.

Palpable answered 19/5, 2021 at 10:49 Comment(0)

Recommended topics

Hot tags