Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
Asked Answered
S

4

14

I am working with following spark config

maxCores = 5
 driverMemory=2g
 executorMemory=17g
 executorInstances=100

Issue: Out of 100 Executors, My job ends up with only 10 active executors, nonetheless enough memory is available. Even tried setting the executors to 250 only 10 remains active.All I am trying to do is loading a mulitpartition hive table and doing df.count over it.

Please help me understanding the issue causing the executors kill
17/12/20 11:08:21 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
17/12/20 11:08:21 INFO storage.DiskBlockManager: Shutdown hook called
17/12/20 11:08:21 INFO util.ShutdownHookManager: Shutdown hook called

Not sure why yarn is killing my executors.

Sindhi answered 20/12, 2017 at 13:51 Comment(3)
You should really look at the yarn logs using yarn logs -applicationId if available.All
The executor is memory overhead, turn up the value of spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead or both. Let me know result. Also check if some other memory consuming job in background.Impolite
any solution? @FreemanXenon
S
4

I faced a similar issue where the investigation of the NodeManager-Logs lead me to the root cause. You can access them via the Web-interface

nodeManagerAddress:PORT/logs

The PORT is specified in the yarn-site.xml under yarn.nodemanager.webapp.address. (default: 8042)

My Investigation-Workflow:

  1. Collect logs (yarn logs ... command)
  2. Identify node and container (in these logs) emitting the error
  3. Search the NodeManager-logs by Timestamp of the error for a root cause

Btw: you can access the aggregated collection (xml) of all configurations affecting a node at the same port with:

 nodeManagerAdress:PORT/conf
Shortridge answered 8/3, 2018 at 20:3 Comment(1)
I checked nothing useful there.Tamera
L
4

I believe this issue has more to do with the memory and the dynamic time allocations on executor/container levels. Make sure you can change the config params on executor/container level.

One of the ways you can resolve this issue is by changing this config value either on your spark-shell or spark job.

spark.dynamicAllocation.executorIdleTimeout

This thread has more detailed information on how to resolve this issue which worked for me: https://jira.apache.org/jira/browse/SPARK-21733

Lieutenant answered 8/3, 2019 at 2:41 Comment(0)
S
0

I had the same issue, my spark job was using only 1 task node and killing the other provisioned nodes. This also happened when switching to EMR Serverless, my job was being run on only one "thread". Please see below as it fixed it for me:

spark-submit \
--name KSSH-0.3 \
--class com.jiuye.KSSH     \
--master yarn     \
--deploy-mode cluster     \
--driver-memory 2g     \
--executor-memory 2g     \
--executor-cores   1   \
--num-executors 8 \
--jars $(echo /opt/software/spark2.1.1/spark_on_yarn/libs/*.jar | tr ' ' ',') \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.yarn.am.memory=1024m" \
--conf "spark.yarn.am.memoryOverhead=1024m" \
--conf "spark.yarn.driver.memoryOverhead=1024m" \
--conf "spark.yarn.executor.memoryOverhead=1024m" \
--conf "spark.yarn.am.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=20 -XX:+DisableExplicitGC -Dcdh.version=5.12.0" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.streaming.kafka.maxRatePerPartition=1250" \
--conf "spark.locality.wait=1s" \
--conf "spark.shuffle.consolidateFiles=true" \

--conf "spark.executor.heartbeatInterval=360000" \
--conf "spark.network.timeout=420000" \

--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" \
/opt/software/spark2.1.1/spark_on_yarn/KSSH-0.3.jar
Stellastellar answered 13/12, 2022 at 15:33 Comment(1)
This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From ReviewEquivocal
P
0

I faced this issue because of the tls misconfiguration of the spark worker. So this issue occurs when the spark-worker, spark-master and spark-client (the initiator of the spark job) cannot connect with each other.

Please check the connection between each components is healthy.

Partridge answered 24/4 at 8:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.