Is there a common place to store data schemas in Hadoop?

Asked 30/5, 2013 at 17:46 Answered 1/6, 2013 at 12:34

I've been doing some investigation lately around using Hadoop, Hive, and Pig to do some data transformation. As part of that I've noticed that the schema of data files doesn't seem to attached to files at all. The data files are just flat files (unless using something like a SequenceFile). Each application that wants to work with those files has its own way of representing the schema of those files.

For example, I load a file into the HDFS and want to transform it with Pig. In order to work effectively with it I need to specify the schema of the file when I load the data:

EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int};

Now, I know that when storing a file using PigStorage, the schema can optionally be written out along side it, but in order to get a file into Pig in the first place it seems like you need to specify a schema.

If I want to work with the same file in Hive, I need to create a table and specify the schema with that too:

CREATE EXTERNAL TABLE EMP ( first_name string
                          , last_name string
                          , empno int)
LOCATION 'myfile';

It seems to me like this is extremely fragile. If the file format changes even slightly then the schema must be manually updated in each application. I'm sure I'm being naive but wouldn't it make sense to store the schema with the data file? That way the data is portable between applications and the barrier to using another tool would be lower since you wouldn't need to re-code the schema for each application.

So the question is: Is there a way to specify the schema of a data file in Hadoop/HDFS or do I need to specify the schema for the data file in each application?

Martinmas answered 30/5, 2013 at 17:46 Comment(0)

It looks like you are looking for Apache Avro. With Avro your schema is embedded in your data, so you can read it without having to worry about schema issues and it makes schema evolution really easy.

The great thing about Avro is that it is completely integrated in Hadoop and you can use it with a lot of Hadoop sub-projects like Pig and Hive.

For example with Pig you could do:

EMP = LOAD 'myfile.avro' using AvroStorage();

I would advise looking at the documentation for AvroStorage for more details.

You can also work with Avro with Hive as described here but I have not used that personally but it should work the same way.

Cispadane answered 30/5, 2013 at 17:55 Comment(4)

From the looks of the documentation it appears that the schema has to be external in order to work with an Avro file in Hive. I've cat'ed the avro files and I can see the schema in the header but for whatever reason Hive won't pick it up. Any suggestions? – Martinmas 30/5, 2013 at 23:32

It doesn't necessarily need to be external, you could for example set avro.schema.literal in the TBLPROPERTIES field as you create your table, or you could store your schema in JSON in hdfs and then have avro.schema.url point to that location in hdfs. – Cispadane 30/5, 2013 at 23:49

I'm using Sqoop to pull the data from a database into Avro data files. I don't see any option in Sqoop to have it store the schema externally. Is there a way that I can extract the schema from the Avro files? – Martinmas 31/5, 2013 at 14:56

Yes you could do it with avro cat >>> --print-schema /path/to/avro/file – Cispadane 31/5, 2013 at 15:28

What you need is HCatalog which is

"Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

This includes:

Providing a shared schema and data type mechanism.

Providing a table abstraction so that users need not be concerned with where or how their data is stored.

Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive."

You can take a look at the "data flow example" in the docs to see exactly the scenario you are talking about

Travelled answered 1/6, 2013 at 12:34 Comment(0)

Apache Zebra seems to be the tool that could provide a common schema definition across mr, pig and hive. It has its own schema store. MR job can use its built in TableStore to write to HDFS.

Subrogate answered 1/6, 2013 at 4:33 Comment(0)

Recommended topics

Hot tags