hadoop converting \r\n to \n and breaking ARC format
Asked Answered
C

1

4

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like

cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb

It works as expected.

It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).

To double check, I changed my mapper to expect uncompressed data, and did:

cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb

And it works.

I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.

How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.

All of this will eventually run on AWS ElasticMapReduce.

Chaste answered 25/1, 2012 at 8:32 Comment(1)
Any updates on this front? Is this related to: github.com/hayesgm/common_crawl_types ?Ido
P
2

Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):

Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.

A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).

Pelisse answered 28/3, 2012 at 2:19 Comment(3)
Awesome man! You'll get many an upvote for being so helpful here. I suppose maybe I'm going about this wrong to have TextInputFormat in the first place. If I'm concerned about character-encoding to a byte-level, I should use a SequenceFileInput. Does that make sense? Again, this is a huge help, -- and it would be nice if Hadoop preserved line-endings of the original file.Ido
@ghayes SequenceFileInputFormat isn't going to help unless the input files are in this format. You'd probably need to write your own input formatPelisse
Yeah.. at the end of the day, found that feeding in a pseudo-binary file line by line was inappropriate. I would suggest using a different InputFormat (e.g. ArcFileFormat), but that would require diving knee-deep in Java for a Steaming Ruby MapReduce job. Alternatively, I took the standard cowards way out, and make the input a list of files (since gzipped files are splittable anyway), and process them myself. MapReduce default types aren't, per se, versatile. :-\Ido

© 2022 - 2024 — McMap. All rights reserved.