hadoop MultipleInputs fails with ClassCastException
Asked Answered
C

2

18

My hadoop version is 1.0.3,when I use multipleinputs, I got this error.

java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit
at org.myorg.textimage$ImageMapper.setup(textimage.java:80)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:55)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

I tested single input path, no problem. Only when I use

MultipleInputs.addInputPath(job, TextInputpath, TextInputFormat.class,
            TextMapper.class);
    MultipleInputs.addInputPath(job, ImageInputpath,
            WholeFileInputFormat.class, ImageMapper.class); 

I googled and found this link https://issues.apache.org/jira/browse/MAPREDUCE-1178 which said 0.21 had this bug. But I am using 1.0.3, does this bug come back again. Anyone has the same problem or anyone can tell me how to fix it? Thanks

here is the setup code of image mapper,4th line is where the error occurs:

protected void setup(Context context) throws IOException,
            InterruptedException {
        InputSplit split = context.getInputSplit();
        Path path = ((FileSplit) split).getPath();
        try {
            pa = new Text(path.toString());
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
Careen answered 21/6, 2012 at 0:44 Comment(2)
Can you post the code for your ImageMapper class - it looks like you are trying to cast the input split to a FileInputSplit in your setup method.Dishearten
I have a similar issue.. Is there any solution exists?Weimaraner
D
32

Following up on my comment, the Javadocs for TaggedInputSplit confirms that you are probably wrongly casting the input split to a FileSplit:

/**
 * An {@link InputSplit} that tags another InputSplit with extra data for use
 * by {@link DelegatingInputFormat}s and {@link DelegatingMapper}s.
 */

My guess is your setup method looks something like this:

@Override
protected void setup(Context context) throws IOException,
        InterruptedException {
    FileSplit split = (FileSplit) context.getInputSplit();
}

Unfortunately TaggedInputSplit is not public visible, so you can't easily do an instanceof style check, followed by a cast and then call to TaggedInputSplit.getInputSplit() to get the actual underlying FileSplit. So either you'll need to update the source yourself and re-compile&deploy, post a JIRA ticket to ask this to be fixed in future version (if it already hasn't been actioned in 2+) or perform some nasty nasty reflection hackery to get to the underlying InputSplit

This is completely untested:

@Override
protected void setup(Context context) throws IOException,
        InterruptedException {
    InputSplit split = context.getInputSplit();
    Class<? extends InputSplit> splitClass = split.getClass();

    FileSplit fileSplit = null;
    if (splitClass.equals(FileSplit.class)) {
        fileSplit = (FileSplit) split;
    } else if (splitClass.getName().equals(
            "org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
        // begin reflection hackery...

        try {
            Method getInputSplitMethod = splitClass
                    .getDeclaredMethod("getInputSplit");
            getInputSplitMethod.setAccessible(true);
            fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
        } catch (Exception e) {
            // wrap and re-throw error
            throw new IOException(e);
        }

        // end reflection hackery
    }
}

Reflection Hackery Explained:

With TaggedInputSplit being declared protected scope, it's not visible to classes outside the org.apache.hadoop.mapreduce.lib.input package, and therefore you cannot reference that class in your setup method. To get around this, we perform a number of reflection based operations:

  1. Inspecting the class name, we can test for the type TaggedInputSplit using it's fully qualified name

    splitClass.getName().equals("org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")

  2. We know we want to call the TaggedInputSplit.getInputSplit() method to recover the wrapped input split, so we utilize the Class.getMethod(..) reflection method to acquire a reference to the method:

    Method getInputSplitMethod = splitClass.getDeclaredMethod("getInputSplit");

  3. The class still isn't public visible so we use the setAccessible(..) method to override this, stopping the security manager from throwing an exception

    getInputSplitMethod.setAccessible(true);

  4. Finally we invoke the method on the reference to the input split and cast the result to a FileSplit (optimistically hoping its a instance of this type!):

    fileSplit = (FileSplit) getInputSplitMethod.invoke(split);

Dishearten answered 21/6, 2012 at 1:30 Comment(6)
It is exactly the same as what you guessed. I tried your code and it is working now. You are so pro!Careen
I read the codes again but I am still not sure how it works. Can you give a brief explanation about reflection hackery? Thanks.Careen
See the added section at the endDishearten
This workaround is great but does anybody know of a less hacky way to pass information from Job to the Mapper with MultipleInputs? FWIW, there is issues.apache.org/jira/browse/MAPREDUCE-2226Catharine
What sort of information are you looking to pass - what's wrong with the passing via the Configuration?Dishearten
thanks Chris... learned something new today.Although this issue was fixed in hadoop 0.20+ ... but I am still facing this issue in 2.3. I tried your code snippet and it worked. I cant use it in prod though .. have to find a way around.Geld
H
0

I had this same problem, but the actual problem was that I was still setting the InputFormat after setting up the MultipleInputs:

job.setInputFormatClass(SequenceFileInputFormat.class);

Once I removed this line everything worked fine.

Halogen answered 6/12, 2013 at 16:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.