How can I add a header row to files created from Pig (Hadoop)?
Asked Answered
E

4

7

I'm writing a pig latin script similar to the following:

A = load 'data' using PigStorage('\t');
store A into my_data using PigStorage();

This outputs

(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

I'd like to add a first header row to each file stored in HDFS

(Name, Age, GPA)
(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

Any ideas?

Ellisellison answered 7/1, 2013 at 21:24 Comment(0)
M
11

This doesn't really make sense for Pig. Each line is a separate record of data, and so unless there is really a person named Name, with an age of Age, and a GPA of GPA, having such a line is wrong. Also, Pig makes no guarantees about the order in which fields will be output (unless using ORDER BY), so your header row might show up anywhere.

What you are asking for is a way to keep your schema around after Pig is done with its work, so that you don't have to remember what it is or look it up somewhere. Starting with Pig 0.10, this has been possible with PigStorage by storing the schema of the relation as a JSON file .pig_schema, in the same directory as the output. See this page for more detailed information about what that is and how to use it.

Muncey answered 7/1, 2013 at 22:56 Comment(2)
You are just justifying lack of a feature, by explaining why the developers didn't add it. If the end-consumer really wants it, would you argue against why it doesn't make sense ?Contuse
It is possible by using a different storage function. See my answer for a link: #14204775Datha
D
15

You can use CSVExcelStorage as the storage function which allows you to do precisely what you want:

STORE output INTO '/outputfolder/' USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');

Using the "WRITE_OUTPUT_HEADER" option will write the header to every file which satisfies your use case.

Datha answered 1/7, 2015 at 7:55 Comment(0)
M
11

This doesn't really make sense for Pig. Each line is a separate record of data, and so unless there is really a person named Name, with an age of Age, and a GPA of GPA, having such a line is wrong. Also, Pig makes no guarantees about the order in which fields will be output (unless using ORDER BY), so your header row might show up anywhere.

What you are asking for is a way to keep your schema around after Pig is done with its work, so that you don't have to remember what it is or look it up somewhere. Starting with Pig 0.10, this has been possible with PigStorage by storing the schema of the relation as a JSON file .pig_schema, in the same directory as the output. See this page for more detailed information about what that is and how to use it.

Muncey answered 7/1, 2013 at 22:56 Comment(2)
You are just justifying lack of a feature, by explaining why the developers didn't add it. If the end-consumer really wants it, would you argue against why it doesn't make sense ?Contuse
It is possible by using a different storage function. See my answer for a link: #14204775Datha
C
1

The answer is no you can not do what you really want to do.

As @Winni suggested , there are workarounds by keeping a schema file around, but that is hell lot of hack.

Wearing the consumer hat (I am also a developer), I have to say Pig lacks this feature. We dont care how much of sense it makes for pig when it is outputting something in PigStorage as essentially a CSV file, to provide ability to also have the header row, for those puny forgetful users to make sense of data.

When, I have a row, with around ten different datetimes, and it makes near-impossible for me to understand the data, until I manually add the header row.

Contuse answered 21/9, 2013 at 2:14 Comment(0)
C
0

I think your best bet is to DESCRIBE the relation you're going to output on a test set at the Grunt shell, then copy & paste that into a e.g. bash command that appends the record to the top of your file after you -get it from HDFS and cat it to a flat file. So something like:

sed -i '1s/^/(Name, Age, GPA) /' filename.tsv

(Note that as written this will write-in-place, so maybe direct output to a new file if you're a shell command n00b.)

Caa answered 9/10, 2013 at 18:31 Comment(1)
I hope it's obvious that this solution acts on a concatenation of part* files output from HDFS, not the files stored in HDFS. As previous posters noted, it makes more sense to do what you want to do with some output from HDFS.Caa

© 2022 - 2024 — McMap. All rights reserved.