how to load a tarball to pig

V

2

5

i have a log files that is in a tarball (access.logs.tar.gz) loaded into my hadoop cluster. I was wondering is their way to directly load it to pig with out untaring it?

Virtually answered 17/4, 2012 at 4:21 Comment(0)

N

4

PigStorage will recognize the file is compressed (by the .gz extension, this is actually implemented in the TextInputFormat which PigTextInputFormat extends), but after that you'll be dealing with a tar file. If you're able to handle the header lines between the files in the tar then you can just use PigStorage as is, otherwise you'll need to write your own extension of PigTextInputFormat to handle stripping out the tar header lines between each file

Northerner answered 17/4, 2012 at 10:37 Comment(0)

P

6

@ChrisWhite's answer is technically correct and you should accept his answer instead of mine (IMO at least).

You need to get away from tar.gz files with Hadoop. Gzip files are not splittable, so you get in the situation where if your gzip files are large, you're going to see hotspotting in your mappers. For example, if you have a .tar.gz file that is 100gb, you aren't going to be able to split the computation.

Let's say on the other hand that they are tiny. In which case, Pig will do a nice job of collecting them together and the splitting problem goes away. This has the downside of the fact that now you are dealing with tons of tiny files with the NameNode. Also, since the files are tiny, it should be relatively cheap computationally to reform the files into a more reasonable format.

So what format should you reformulate the files into? Good question!

Just concatenating them all into one large block-level compressed sequence file might be the most challenging but the most rewarding in terms of performance.
The other is to just ignore compression entirely and just explode those files out, or at least concatenate them (you do see performance hits without compression).
Finally, you could blob files into ~100MB chunks and then gzip them.

I think it would be completely reasonable to write some sort of tarball loader into piggybank, but I personally would just rather lay the data out differently.

Posthaste answered 19/4, 2012 at 2:6 Comment(0)

N

4

PigStorage will recognize the file is compressed (by the .gz extension, this is actually implemented in the TextInputFormat which PigTextInputFormat extends), but after that you'll be dealing with a tar file. If you're able to handle the header lines between the files in the tar then you can just use PigStorage as is, otherwise you'll need to write your own extension of PigTextInputFormat to handle stripping out the tar header lines between each file

Northerner answered 17/4, 2012 at 10:37 Comment(0)

Recommended topics

Hot tags