I'm running a large (more than 100 nodes) series of mapreduce jobs on Amazon Elastic MapReduce.
In the reduce phase, already-completed map tasks keep failing with
Map output lost, rescheduling: getMapOutput(attempt_201204182047_0053_m_001053_0,299) failed :
java.io.IOException: Error Reading IndexFile
at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:113)
at org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:66)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3810)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:74)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:54)
at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:109)
... 23 more
The proportion of mappers for which this happens is few enough that I wouldn't mind except that when it does, the reducers all pause and wait for the 1 map-task to rerun so the entire job keeps pausing for 1-5 minutes each time.
I think this is related to this bug -> https://issues.apache.org/jira/browse/MAPREDUCE-2980 Does anyone know how to run an EMR job without this happening?
EDIT:
Here's some more information if it's any help. The input format is SequenceFileInputFormat
.
The output format is a slightly modified version of SequenceFileOutputFormat
.
The key-value pairs are user-defined (the value is large and implements Configurable
).
There's no Combiner
, just Mapper
and Reducer
.
I'm using block compression for the input and output (and also there's record compression going on for intermediate kv pairs. This is default for EMR). The codec is the default which is SnappyCodec
I believe.
Finally, it's actually a series of jobs which are run in sequence, each one using the output of the previous job as the input to the next.
The first couple jobs are small and run fine. It's only when the jobs start to grow really big that this happens.