Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?
Asked Answered
S

3

13

I think the title is already explaining my question. I would like to change

key (tab space) value

into

key;value

in all output files the reducers are generating from the output of mappers.

I could not find good documentation on this using google. Can anyone please give a fraction of code on how to achieve this?

Stieglitz answered 14/6, 2012 at 11:6 Comment(2)
what version (0.20.2, 0.20.20x, 1.0.x, 2.0.0?) and distro (Apache, Cloudera?) of hadoop are you usingZea
What are you using as your Output format class: o.a.h.mapred.TextOutputFormat, or o.a.h.mapreduce.lib.output.TextOutputFormat?Zea
Z
19

Set the configuration property mapred.textoutputformat.separator to ";"

Zea answered 14/6, 2012 at 12:5 Comment(13)
it should be mapreduce.output.textoutputformat.separator if you are using the new APIAldine
1.0.0 still shows mapred.textoutputformat.separator in its source for o.a.h.mapreduce.lib.output.TextOutputFormat - svn.apache.org/viewvc/hadoop/common/tags/release-1.0.0/src/…, line 115Zea
MR2 (YARN), this has changed to mapreduce.textoutputformat.separatorZea
I am using 1.0.2 hadoop version. can't find the specified path. how can I import it to my java code?Stieglitz
Specified path for what, import what?Zea
I am expecting something like org.apache.hadoop.mapreduce.textoutputformat.separator; or is this configuration xml configuration file based?Stieglitz
It's a configuration property, set it on your JobConf or Job object (depending on whether you're using the old mapred or new mapreduce API)Zea
@Chris: please look at this, new API does not seem to support the property. lucene.472066.n3.nabble.com/…Stieglitz
lets be clear about what we mean by new vs old API. If you're using anything prior to hadoop 2 (YARN), then it's mapred.textoutputformat.separator irrespective to whether you're using o.a.h.mapred.TextOutputFormat or o.a.h.mapreduce.lib.output.TextOutputFormat, otherwise with 2+ (YARN) it's mapreduce.textoutputformat.separator irrespective to whatever version of TextOutputFormat you're using. For Cloudera based releases, prior to v4 its mapred.textoutputformat.separator, v4+ it's mapreduce.textoutputformat.separatorZea
Configuration conf = new Configuration(); conf.set("mapred.textoutputformat.separator",";"); is this correct way of specifying? Configuration conf = new Configuration(); conf.set("mapred.textoutputformat.separator",";"); did not work.Stieglitz
Give me some more context around this line of code. Are you using JobConf or Job? Change your code to either jobConf.set(...) or job.getConfiguration().set(...) depending on which API you're usingZea
Configuration conf = new Configuration(); conf.set("mapred.textoutputformat.separator",";"); worked! Thx, without the *s around the parameter. Thx!Stieglitz
In Cloudera CDH4.3.0 (mr1), it is mapred.textoutputformat.separator (I just decompiled the JAR to check, since no sources JAR is provided).Bomarc
F
16

In lack of better documentation, here's what I've collected:

    setTextOutputFormatSeparator(final Job job, final String separator){
            final Configuration conf = job.getConfiguration(); //ensure accurate config ref

            conf.set("mapred.textoutputformat.separator", separator); //Prior to Hadoop 2 (YARN)
            conf.set("mapreduce.textoutputformat.separator", separator);  //Hadoop v2+ (YARN)
            conf.set("mapreduce.output.textoutputformat.separator", separator);
            conf.set("mapreduce.output.key.field.separator", separator);
            conf.set("mapred.textoutputformat.separatorText", separator); // ?
    }
Ferricyanide answered 8/9, 2013 at 17:38 Comment(0)
A
1

you can use "KEY_VALUE_SEPERATOR " property of "KeyValueLineRecordReader" to specify a separator of your choice.

Aldine answered 14/6, 2012 at 12:10 Comment(1)
This property can be set when reading the data back in, but isn't used for outputZea

© 2022 - 2024 — McMap. All rights reserved.