Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?

Asked 14/6, 2012 at 11:6 Answered 8/9, 2013 at 17:38

I think the title is already explaining my question. I would like to change

key (tab space) value

into

key;value

in all output files the reducers are generating from the output of mappers.

I could not find good documentation on this using google. Can anyone please give a fraction of code on how to achieve this?

Stieglitz answered 14/6, 2012 at 11:6 Comment(2)

what version (0.20.2, 0.20.20x, 1.0.x, 2.0.0?) and distro (Apache, Cloudera?) of hadoop are you using – Zea 15/6, 2012 at 13:38

What are you using as your Output format class: o.a.h.mapred.TextOutputFormat, or o.a.h.mapreduce.lib.output.TextOutputFormat? – Zea 15/6, 2012 at 13:39

Set the configuration property mapred.textoutputformat.separator to ";"

Zea answered 14/6, 2012 at 12:5 Comment(13)

it should be mapreduce.output.textoutputformat.separator if you are using the new API – Aldine 14/6, 2012 at 12:23

1.0.0 still shows mapred.textoutputformat.separator in its source for o.a.h.mapreduce.lib.output.TextOutputFormat - svn.apache.org/viewvc/hadoop/common/tags/release-1.0.0/src/…, line 115 – Zea 14/6, 2012 at 12:39

MR2 (YARN), this has changed to mapreduce.textoutputformat.separator – Zea 14/6, 2012 at 13:14

I am using 1.0.2 hadoop version. can't find the specified path. how can I import it to my java code? – Stieglitz 14/6, 2012 at 14:40

Specified path for what, import what? – Zea 14/6, 2012 at 15:54

I am expecting something like org.apache.hadoop.mapreduce.textoutputformat.separator; or is this configuration xml configuration file based? – Stieglitz 15/6, 2012 at 13:15

It's a configuration property, set it on your JobConf or Job object (depending on whether you're using the old mapred or new mapreduce API) – Zea 15/6, 2012 at 13:31

@Chris: please look at this, new API does not seem to support the property. lucene.472066.n3.nabble.com/… – Stieglitz 15/6, 2012 at 13:31

lets be clear about what we mean by new vs old API. If you're using anything prior to hadoop 2 (YARN), then it's mapred.textoutputformat.separator irrespective to whether you're using o.a.h.mapred.TextOutputFormat or o.a.h.mapreduce.lib.output.TextOutputFormat, otherwise with 2+ (YARN) it's mapreduce.textoutputformat.separator irrespective to whatever version of TextOutputFormat you're using. For Cloudera based releases, prior to v4 its mapred.textoutputformat.separator, v4+ it's mapreduce.textoutputformat.separator – Zea 15/6, 2012 at 13:37

Configuration conf = new Configuration(); conf.set("mapred.textoutputformat.separator",";"); is this correct way of specifying? Configuration conf = new Configuration(); conf.set("mapred.textoutputformat.separator",";"); did not work. – Stieglitz 15/6, 2012 at 13:39

Give me some more context around this line of code. Are you using JobConf or Job? Change your code to either jobConf.set(...) or job.getConfiguration().set(...) depending on which API you're using – Zea 15/6, 2012 at 13:42

Configuration conf = new Configuration(); conf.set("mapred.textoutputformat.separator",";"); worked! Thx, without the *s around the parameter. Thx! – Stieglitz 15/6, 2012 at 13:45

In Cloudera CDH4.3.0 (mr1), it is mapred.textoutputformat.separator (I just decompiled the JAR to check, since no sources JAR is provided). – Bomarc 10/2, 2014 at 11:33

In lack of better documentation, here's what I've collected:

    setTextOutputFormatSeparator(final Job job, final String separator){
            final Configuration conf = job.getConfiguration(); //ensure accurate config ref

            conf.set("mapred.textoutputformat.separator", separator); //Prior to Hadoop 2 (YARN)
            conf.set("mapreduce.textoutputformat.separator", separator);  //Hadoop v2+ (YARN)
            conf.set("mapreduce.output.textoutputformat.separator", separator);
            conf.set("mapreduce.output.key.field.separator", separator);
            conf.set("mapred.textoutputformat.separatorText", separator); // ?
    }

Ferricyanide answered 8/9, 2013 at 17:38 Comment(0)

you can use "KEY_VALUE_SEPERATOR " property of "KeyValueLineRecordReader" to specify a separator of your choice.

Aldine answered 14/6, 2012 at 12:10 Comment(1)

This property can be set when reading the data back in, but isn't used for output – Zea 14/6, 2012 at 12:20

Recommended topics

Hot tags