Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a *lost* node"
Asked Answered
A

8

25

I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node and then the job failed. Logs are like this:

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

I am pretty sure that my network setting works because I have tried to run this script on the same environment on a much smaller table.

Also, I am aware that somebody posted a question 6 months ago asking for the same issue:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released but I still have to ask because nobody was answering this question.

Andantino answered 2/7, 2016 at 0:39 Comment(7)
@clay Just my guess. The spot instance will be taken back when the price get higher than your price, and then the node will be lost. So if you are running on a long term job, don't use spot instance . I find a way to split my data set to many little task each of which just run for 5 minutes, and save a reduce result on s3, after all of that, read the result from s3 and do another reduce, so I can avoid long running job.Andantino
I'm hitting this issue as well :/Styptic
Similar issue here (with a big self-join however). Been hitting it for a bit now. Logs on Resource Manager just say the container was lost. There is no indication as to why. Memory might be an issue.Paleozoology
Can you share the logs of the node?Imperception
@Imperception No... the instance was deleted long time ago. And the logs are so large that you wouldn't want to read that.Andantino
Ran into the same issue. Also used spot instance but not sure that's the root cause because we used pretty high bid price and never lost any instances during the job executionMizzen
where did you save your 1TB data? is that on S3 or HDFS?Wonderful
A
14

Looks like other peoples has the same issue as well, so I just post an answer instead of writing a comment. I am not sure that this would solve the issue but this should be an idea.

If you use spot instance, you should know that spot instance will be shut down if the price is higher than your input, and you will hit this issue. Even if you are just using a spot instance as a slave. So my solution is not using any spot instance for long term running job.

Another idea is to slice the job into many independent steps, so you can save the result of each step as a file on S3. If any error happened, just start from that step by the cached files.

Andantino answered 28/12, 2016 at 3:50 Comment(1)
So based on your solutions, first option is to: get dedicated CORE nodes instead of SPOT Task nodes. Second option is to basically break up your job into multiple jobs and run them manually in a progressive manner?Obeded
E
3

is it dynamic allocation of memory ? I had similar issue I fixed it by going with static allocation by calculating executor memory, executor cores and executors. Try Static allocation for huge workloads in Spark.

Excide answered 21/8, 2018 at 22:58 Comment(4)
does unpersisting the unused DFs help ? in this case ??Bipropellant
you can try, are you on EMR or Cloudera stack? also check yarn scheduler for resource management is it fair or capacity and then try static memory allocation instead of dynamic by passing number of executors etc....Excide
Iam using EMR and i dont find any difference in dynamic memory change after using the unpersist as well.Bipropellant
what I am asking you is to use static memory allocation by turning off dynamic, pass in no of executors, executor memory and executor cores by calculating them instead of leaving it to Spark Dynamic memory allocationExcide
F
2

This means your YARN container is down, to debug what happened, you must read YARN logs, use the official CLI yarn logs -applicationId or feel free to use and contribute to my project https://github.com/ebuildy/yoga a YARN viewer as web app.

You should see lot of Worker errors.

Fellows answered 25/7, 2019 at 9:3 Comment(0)
V
2

I was hitting the same problem. I found some clues in this article on DZone:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr

This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.


So I tried to increase the number of DataFrame partitions which solved my issue.

Verdi answered 20/12, 2019 at 22:42 Comment(0)
D
0

Amazon has provided their solution, which is handled through resource allocation, and there is no processing method from the perspective of users

Downstage answered 23/12, 2021 at 2:53 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Doehne
K
0

For my case, we were using GCP Dataproc cluster with 2 Pre-Emptible (default) Secondary Workers.

This isn't a problem for short running jobs since both primary and secondary workers finished quite quickly.

However, for long running jobs, it was observed that all primary workers finished assigned tasks quite quickly relative to secondary workers.

Due to the pre-emptible nature, containers get lost after running 3 hours for the tasks that were assigned to secondary workers. Thus, resulting in Container losts error.

I would recommend to not use secondary workers for any long running jobs.

Kindergartner answered 20/6, 2022 at 3:10 Comment(0)
H
0

Check the CloudWatch metrics and instance-state logs of the node that hosted the container: the node was either marked as unhealthy because of high disk utilisation or there was a hardware issue.

In the former case you should see non-zero value in the "MR unhealthy nodes" metric in AWS EMR UI, and in the case of latter in the "MR lost nodes" metric. Please note that the disk utilisation threshold is configured in YARN using the yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage setting and is 90% by default. Similarly to container logs, AWS EMR exports logs with sort of snapshots of the instance state with lots of useful information such as disk utilisation, CPU utilisation, memory utilisation and stack traces to S3, so have a look at them. To find the EC2 instance ID of a node, match the IP address from container's logs to the ID in AWS EMR UI.

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/                                                 
                           PRE containers/
                           PRE node/
                           PRE steps/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/ 
                           PRE applications/                                                                                                                                                          
                           PRE daemons/                                                                                                                                                               
                           PRE provision-node/                                                                                                                                                        
                           PRE setup-devices/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/daemons/instance-state/
2023-09-24 13:13:33        748 console.log-2023-09-24-12-08.gz
2023-09-24 13:18:34      55742 instance-state.log-2023-09-24-12-15.gz
...
2023-09-24 17:33:58      60087 instance-state.log-2023-09-24-16-30.gz
2023-09-24 17:54:00      66614 instance-state.log-2023-09-24-16-45.gz
2023-09-24 18:09:01      60932 instance-state.log-2023-09-24-17-00.gz

cat /tmp/instance-state.log-2023-09-24-16-30.gz
...
# amount of disk free
df -h
Filesystem        Size  Used Avail Use% Mounted on
...
/dev/nvme0n1p1     10G  5.7G  4.4G  57% /
/dev/nvme0n1p128   10M  3.8M  6.2M  38% /boot/efi
/dev/nvme1n1p1    5.0G   83M  5.0G   2% /emr
/dev/nvme1n1p2    1.8T  1.7T  121G  94% /mnt
/dev/nvme2n1      1.8T  1.7T  120G  94% /mnt1
...

For more information, please see the resources below.

Hyohyoid answered 29/9, 2023 at 10:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.