Spark Job submitted - Waiting (TaskSchedulerImpl : Initial job not accepted)
Asked Answered
R

2

9

API call made to submit the Job. Response states - It is Running

On Cluster UI -

Worker (slave) - worker-20160712083825-172.31.17.189-59433 is Alive

Core 1 out of 2 used

Memory 1Gb out of 6 used

Running Application

app-20160713130056-0020 - Waiting since 5hrs

Cores - unlimited

Job Description of the Application

Active Stage

reduceByKey at /root/wordcount.py:23

Pending Stage

takeOrdered at /root/wordcount.py:26

Running Driver -

stderr log page for driver-20160713130051-0025 

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

According to Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Slaves haven't been started - Hence it doesn't have resources.

However in my case - Slave 1 is working

According to Unable to Execute More than a spark Job "Initial job has not accepted any resources" I am using deploy-mode = cluster (not client) Since I have 1 master 1 slave and Submit API is being called via Postman / anywhere

Also the Cluster has available Cores, RAM, Memory - Still Job throws the error as conveyed by the UI

According to TaskSchedulerImpl: Initial job has not accepted any resources; I assigned

~/spark-1.5.0/conf/spark-env.sh

Spark Environment Variables

SPARK_WORKER_INSTANCES=1
SPARK_WORKER_MEMORY=1000m
SPARK_WORKER_CORES=2

Replicated those across the Slaves

sudo /root/spark-ec2/copy-dir /root/spark/conf/spark-env.sh

All the cases in the answer to above question - were applicable still no solution found. Hence because I was working with APIs and Apache SPark - maybe some other assistance is required.

Edited July 18,2016

Wordcount.py - My PySpark application code -

from pyspark import SparkContext, SparkConf

logFile = "/user/root/In/a.txt"

conf = (SparkConf().set("num-executors", "1"))

sc = SparkContext(master = "spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077", appName = "MyApp", conf = conf)
print("in here")
lines = sc.textFile(logFile)
print("text read")
c = lines.count()
print("lines counted")

Error

Starting job: count at /root/wordcount.py:11
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Got job 0 (count at /root/wordcount.py:11) with 2 output partitions
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at /root/wordcount.py:11)
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] at count at /root/wordcount.py:11), which has no missing parents
16/07/18 07:46:39 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.6 KB, free 56.2 KB)
16/07/18 07:46:39 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.4 KB, free 59.7 KB)
16/07/18 07:46:39 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.17.189:43684 (size: 3.4 KB, free: 511.5 MB)
16/07/18 07:46:39 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (PythonRDD[2] at count at /root/wordcount.py:11)
16/07/18 07:46:39 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/07/18 07:46:54 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

According to Spark UI showing 0 cores even when setting cores in App,

Spark WebUI states zero cores used and indefinite wait no tasks running. The application is also using NO MEMORY whatsoever during run time or cores and immediately hits a status of waiting when starting

Spark version 1.6.1 Ubuntu Amazon EC2

Rountree answered 13/7, 2016 at 19:6 Comment(10)
Tried running another code - Simple python application - Still the error persists from pyspark import SparkContext, SparkConf logFile = "/user/root/In/a.txt" conf = (SparkConf().set("num-executors", "1")) sc = SparkContext(master = "spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077", appName = "MyApp", conf = conf) textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("/user/root/In/output.txt") Rountree
are u able to run it using spark submit ?Schram
Try reducing the memory-per-node settings in spark-submit or APISchram
cant see mto find the setting as far as the API call is concernedRountree
Environment variable of the Master is set as /root/spark/conf/spark-env.conf - export SPARK_WORKER_INSTANCES=1 export SPARK_WORKER_CORES=2 export SPARK_WORKER_MEMORY=1000 export HADOOP_HOME="/root/ephemeral-hdfs" export SPARK_MASTER_IP=ec2-w-x-y-z.compute-1.amazonaws.com export MASTER=cat /root/spark-ec2/cluster-urlRountree
I was able to run the spark submit command on ec2 instance - Master - in Client mode since it says - cluster deploy mode not supported for standalone clusters. However, it still does not create the output file in HDFS. So basically - 1. Cluster deploy mode issue 2.Output not created in HDFS 3. API call for spark-submit not workingRountree
@ChaitanyaBapat The SPARK_WORKER_INSTANCES property is deprecated and will have no effect on your configuration. It looks like you are assigning 1gb of memory and 2 cores per executor, correct ? How much total memory do you have on your worker ?Reid
@Schram I SSH into the Driver as well as Worker, On both - spark-submit wordcount.py runs perfectly. I compared the log with error log of Spark-Submit API - 16/07/18 11:39:50 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks was last point of similarity. Later running code - 16/07/18 11:39:52 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-17-189.ec2.internal:53938) with ID 0 Whereas non-running code gave error - TaskSchedulerImpl: Initial job has not accepted any resources; check ... and have sufficient resourcesRountree
@Hawknight - Yes 6.3gb total available for worker as shown on Cluster UI. I removed SPARK_WORKER_INSTANCES from spark_env.sh still error persistsRountree
Could possibly be because the worker binds to a private non-routable over the Internet address such as 10.0.0.4. I'm on Azure and i've configured a public IP in front of the 10.0.0.4 of the actual cluster connections, so i'm facing the same issue, that it's like it cannot find the worker because it's hidden behind the public IP (with a NAT)... It 's 99% because of that, I believe. Check this too: docs.datastax.com/en/developer/java-driver/2.1/manual/…Houseless
M
1

I also have the same issue. Below are my remarks when it occurs.

1:17:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I noticed that it only occurs during the first query from scala shell where I run something fetching data from hdfs.

When the problem occurs, the webui states that there's not any running applications.

URL: spark://spark1:7077
REST URL: spark://spark1:6066 (cluster mode)
Alive Workers: 4
Cores in use: 26 Total, 26 Used
Memory in use: 52.7 GB Total, 4.0 GB Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed 
Status: ALIVE

It seems that something fails to start , I can't tell exactly which it is.

However restarting the cluster a second time sets the Applications value to 1 and everything works well.

URL: spark://spark1:7077
REST URL: spark://spark1:6066 (cluster mode)
Alive Workers: 4
Cores in use: 26 Total, 26 Used
Memory in use: 52.7 GB Total, 4.0 GB Used
Applications: 1 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE

I'm still investigate, this quick workaround can save times till final solution.

Microgroove answered 11/9, 2016 at 9:35 Comment(0)
W
0

You can take a look at my answer in a similar question Apache Spark on Mesos: Initial job has not accepted any resources:

While most of other answers focuses on resource allocation (cores, memory) on spark slaves, I would like to highlight that firewall could cause exactly the same issue, especially when you are running spark on cloud platforms.

If you can find spark slaves in the web UI, you have probably opened the standard ports 8080, 8081, 7077, 4040. Nonetheless, when you actually run a job, it uses SPARK_WORKER_PORT, spark.driver.port and spark.blockManager.port which by default are randomly assigned. If your firewall is blocking these ports, the master could not retrieve any job-specific response from slaves and return the error.

You can run a quick test by opening all the ports and see whether the slave accepts jobs.

Winograd answered 8/12, 2017 at 2:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.