Apache Pig: Load a file that shows fine using hadoop fs -text

Asked 5/9, 2012 at 17:34 Answered 6/9, 2012 at 14:6

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.

What I've tried:

x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;

but that only gives me garbage. How can I view the file using pig?

What might be of relevance is that my hdfs is still using CDH-2 at the moment. Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.

Goodden answered 5/9, 2012 at 17:34 Comment(4)

I believe your first load uses PigStorage, but maybe you can double check by being explicit, x = LOAD 'part-r-00000' USING USING PigStorage('\t'). When you download the file locally, if you view it (i.e. tail), is it garbage/binary? Can you give example of code that generated this data? – Coalesce 5/9, 2012 at 21:34

Using PigStorage explicitly gives the same result. Downloading to local (using -get or -copyToLocal) the file is not readable, i. e. binary/garbage (less or tail). I'll try to find the code that created these files and report back. – Goodden 5/9, 2012 at 22:56

It looks like the file has been stored as a sequence file. I've been able to extract lines from it using an user defined load function. Is there a simpler way than using the udf? – Goodden 6/9, 2012 at 10:51

I updated my answer with sample code related to sequencefile's. hope that helps :) – Coalesce 6/9, 2012 at 12:18

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.

If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.

I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.

Update: Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:

-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();


-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot 
    USING SequenceFileLoader AS (key:long, val:long, etc.);

Coalesce answered 6/9, 2012 at 2:20 Comment(1)

{00..99} didn't work, so I'm simply using * instead. Afterwards the line can be read using B = FOREACH A GENERATE flatten(STRSPLIT (val, '\t')) AS (etc.), since SequenceFileLoader returns only two columns. – Goodden 13/9, 2012 at 14:47

If you want to manipulate (read/write) sequence files with Pig then you can give a try to Twitter's Elephant-Bird as well.

You can find here examples how to read/write them.

If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .

Note, that Elephant-Bird needs to have an installed Thrift in your machine. Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:

<plugin>
  <groupId>org.apache.thrift.tools</groupId>
  <artifactId>maven-thrift-plugin</artifactId>
  <version>0.1.10</version>
  <configuration>
    <thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
  </configuration>
</plugin>

Pollerd answered 6/9, 2012 at 14:6 Comment(0)

Recommended topics

Hot tags