How to Get Pig to Work with lzo Files?

So, I've seen a couple of tutorials for this online, but each seems to say to do something different. Also, each of them doesn't seem to specify whether you're trying to get things to work on a remote cluster, or to locally interact with a remote cluster, etc...

That said, my goal is just to get my local computer (a mac) to make pig work with lzo compressed files that exist on a Hadoop cluster that's already been setup to work with lzo files. I already have Hadoop installed locally and can get files from the cluster with hadoop fs -[command].

I also already have pig installed locally and communicating with the hadoop cluster when I run scripts or when I just run stuff through grunt. I can load and play around with non-lzo files just fine. My problem is only in terms of figuring out a way to load lzo files. Maybe I can just process them through the cluster's instance of ElephantBird? I have no idea, and have only found minimal information online.

So, any sort of short tutorial or answer for this would be awesome, and would hopefully help more people than just me.

I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!

NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).

Hooking PIG up to be able to work with LZOs

This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:

Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.
Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile this on a 64bit machine.
Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.
Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib
Then configure hadoop and pig to have the property java.library.path point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:
```
<property>
    <name>mapred.child.env</name>
    <value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
</property>
```
Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.
Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).
Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.
command: ant in the elephant-bird folder in order to create a jar.
For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.
Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.

Hooking PIG up to be able to work with LZOs

Recommended topics

Hot tags