Elastic Mapreduce Map output lost
Asked Answered
P

1

11

I'm running a large (more than 100 nodes) series of mapreduce jobs on Amazon Elastic MapReduce.

In the reduce phase, already-completed map tasks keep failing with

Map output lost, rescheduling: getMapOutput(attempt_201204182047_0053_m_001053_0,299) failed :
java.io.IOException: Error Reading IndexFile
    at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:113)
    at org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:66)
    at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3810)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
    at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
    at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
    at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at java.io.DataInputStream.readLong(DataInputStream.java:399)
    at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:74)
    at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:54)
    at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:109)
    ... 23 more

The proportion of mappers for which this happens is few enough that I wouldn't mind except that when it does, the reducers all pause and wait for the 1 map-task to rerun so the entire job keeps pausing for 1-5 minutes each time.

I think this is related to this bug -> https://issues.apache.org/jira/browse/MAPREDUCE-2980 Does anyone know how to run an EMR job without this happening?

EDIT: Here's some more information if it's any help. The input format is SequenceFileInputFormat. The output format is a slightly modified version of SequenceFileOutputFormat. The key-value pairs are user-defined (the value is large and implements Configurable). There's no Combiner, just Mapper and Reducer. I'm using block compression for the input and output (and also there's record compression going on for intermediate kv pairs. This is default for EMR). The codec is the default which is SnappyCodec I believe. Finally, it's actually a series of jobs which are run in sequence, each one using the output of the previous job as the input to the next. The first couple jobs are small and run fine. It's only when the jobs start to grow really big that this happens.

Pennoncel answered 19/4, 2012 at 6:39 Comment(13)
I attempted using AMI 1.0 and hadoop 20.2, but I ran into a completely different bug. Something having to do with memory, starting processes and forking.Pennoncel
seems like a configuration problem to me. did you define tmp folder for map-reduce job? checkout hadoop.tmp.dir property and if you have access to that folder. can you please post your code / config ?Germayne
I'm using all EMR defaults. It's a whole project. There's no specific part of my code I can post that will be particularly helpful here.Pennoncel
I should clarify this doesn't happen immediately or consistently. It's something like 1 in 30 tasks that fails in this way and that's only when the job grows large enough. Anything less than, say 500 map/reduce tasks, doesn't have this problem.Pennoncel
Are you using a custom JAR? Where did you get the Hadoop classes to compile against? docs.amazonwebservices.com/ElasticMapReduce/latest/…Gabrielegabriell
Yes, I'm using a custom JAR. The project is a maven project. I set hadoop as a "provided" dependency and specified version 0.20.205.0 which is the same version EMR uses. The jar was constructed using maven shade plugin. None of this matters. I know everything compiled correctly. I'm not asking if there's a bug in my program. What I want to know is how I can work around a known jetty bug.Pennoncel
Sorry, I posted the wrong link. Here's the JIRA for the bug I actually meant. issues.apache.org/jira/browse/MAPREDUCE-2389 Scroll down to the second comment and you'll see the exact same exception occuring in the exact same circumstancesPennoncel
You should be able to use an older version of Jetty by using an older version of Hadoop. If you're using the old API, you might be able to fall back to 0.18. If you're using the new API, you might be out of luck, since 20.2 seems to be the oldest EMR supports.Jampack
Otherwise you can set up a bootstrap action that replaces the Jetty jar before your job runs.Jampack
The bootstrap action sounds promising. There's a special Jetty version given on the link to the JIRA I posted which apparently specifically fixes this bug so I can compile that jar. I don't know how to set up a bootstrap action to use the jar to replace the old one. Could you tell me more about how to do that?Pennoncel
@dspyz-was the issue resolved? to me it seems that the reduce task is facing issues accessing the intermediate data during its shuffle phase and re-runs the map task to re-create the data.Aret
Sorry, I don't remember anything about this. It was four years ago.Pennoncel
Given the comments I decided to vote to close because it cannot be reproduced.Blackett
B
0

There has been some development since this question was asked. The question was discussed in more detail in jira MAPREDUCE-2389.

A key part in this jira, and any other jira that was linked here or there, is that the issue was linked to the jetty version. At the time it was not possible to go to a newer jetty version.

By now the referred version is long deprecated, and as such the situation should be fully resolved for everyone.

Blackett answered 1/8, 2020 at 20:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.