Map Reduce output to CSV or do I need Key Values?

S

2

6

My map function produces a

Key\tValue

Value = List(value1, value2, value3)

then my reduce function produces:

Key\tCSV-Line

Ex.

2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s,

2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s

Ex. RawData: 232342|@3423@|34343|sfasdfasdF|433443|Sfasfdas|324343 x 1000

Anyway I want to eliminate the key's at the beginning of that so my client can do a straight import into mysql. I have about 50 data files, my question is after it maps them once and the reducer starts does it need the key printed out with the value or can I just print the value?

More information:

Here this code might shine some better light on the situation

http://pastebin.ca/2410217

this is kinda what I plan to do.

Strive answered 26/6, 2013 at 23:38 Comment(2)

Could you please rephrase your question?Do you want to emit only the values and not the keys?I'm sorry, I didn't quite get it. – Schellens 27/6, 2013 at 1:23

Yes thats exactly what I want haha, sorry for being so unclear. I just want to make sure when I use multiple servers on multiple data files that emitting only the values and not keys in the reduce.py wont break the whole operation – Strive 27/6, 2013 at 16:20

M

2

Your reducer can emit a line without \t, or, in your case, just what you're calling the value. Unfortunately, hadoop streaming will interpret this as a key with a null value and automatically append a delimiter (\t by default) to the end of each line. You can change what this delimiter is but, when I played around with this, I could not get it to not append a delimiter. I don't remember the exact details but based on this (Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?) I think the property is mapred.textoutputformat.separator. My solution was to strip the \t at the end of each line as I pulled the file back:

hadoop fs -cat hadoopfile | perl -pe 's/\t$//' > destfile

Mckinnie answered 27/6, 2013 at 21:55 Comment(3)

My output looks correct as a CSV file, however you are saying that if my reducer outputs a value without a key it will just make a null key and wont work anyway? I pastied my code maybe you can take a look and see if it will work: pastebin.ca/2410217 – Strive 28/6, 2013 at 0:8

See I don't output any key just the values (csv seperated), but you are saying that this will just cause emr to say its a null key and not work? – Strive 28/6, 2013 at 0:10

EMR is Elastic Map Reduce from Amazon? I haven't used that. We're running a "vanilla" hadoop cluster which we submit jobs to and pull data down from. In that environment, if your reducer outputs a row that doesn't contain a delimiter, hadoop adds a delimiter to the end of the row. I think it is interpreting the row as a key with a null value. As far as running on EMR, I can't guess. Can you test it with a small dataset? On a side note, looks like you're adding an extra ',' to the end of each row. – Mckinnie 29/6, 2013 at 15:30

S

13

If you do not want to emit the key set it to NullWritable in your code. For example :

public static class TokenCounterReducer extends
            Reducer<Text, IntWritable, NullWritable, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            context.write(NullWritable.get(), new IntWritable(sum));
//          context.write(key, new IntWritable(sum));
        }

Let me know if this is not what you need, i'll update the answer accordingly.

Schellens answered 27/6, 2013 at 1:43 Comment(2)

Thanks for the response. I believe that is using C# or Java by the looks of it, I am currently using python. I will update my question with some code ot make it more obvious :D – Strive 27/6, 2013 at 15:58

Added some code to maybe help me get this resolved :) pastebin.ca/2410217 Maybe this explains a little better what I am doing, and I want to know if it works haha – Strive 28/6, 2013 at 0:10

M

2

Your reducer can emit a line without \t, or, in your case, just what you're calling the value. Unfortunately, hadoop streaming will interpret this as a key with a null value and automatically append a delimiter (\t by default) to the end of each line. You can change what this delimiter is but, when I played around with this, I could not get it to not append a delimiter. I don't remember the exact details but based on this (Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?) I think the property is mapred.textoutputformat.separator. My solution was to strip the \t at the end of each line as I pulled the file back:

hadoop fs -cat hadoopfile | perl -pe 's/\t$//' > destfile

Mckinnie answered 27/6, 2013 at 21:55 Comment(3)

My output looks correct as a CSV file, however you are saying that if my reducer outputs a value without a key it will just make a null key and wont work anyway? I pastied my code maybe you can take a look and see if it will work: pastebin.ca/2410217 – Strive 28/6, 2013 at 0:8

See I don't output any key just the values (csv seperated), but you are saying that this will just cause emr to say its a null key and not work? – Strive 28/6, 2013 at 0:10

EMR is Elastic Map Reduce from Amazon? I haven't used that. We're running a "vanilla" hadoop cluster which we submit jobs to and pull data down from. In that environment, if your reducer outputs a row that doesn't contain a delimiter, hadoop adds a delimiter to the end of the row. I think it is interpreting the row as a key with a null value. As far as running on EMR, I can't guess. Can you test it with a small dataset? On a side note, looks like you're adding an extra ',' to the end of each row. – Mckinnie 29/6, 2013 at 15:30

Recommended topics

Hot tags