Why does a job fail with "No space left on device", but df says otherwise?

Asked 7/9, 2014 at 6:44 Answered 4/2, 2020 at 14:43

When performing a shuffle my Spark job fails and says "no space left on device", but when I run df -h it says I have free space left! Why does this happen, and how can I fix it?

Loco answered 7/9, 2014 at 6:44 Comment(1)

Do you have free space left on the drive that Spark is writing tmp files to? – Electrograph 7/9, 2014 at 21:31

You need to also monitor df -i which shows how many inodes are in use.

on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks.

https://spark-project.atlassian.net/browse/SPARK-751

If you do indeed see that disks are running out of inodes to fix the problem you can:

Decrease partitions (see coalesce with shuffle = false).
One can drop the number to O(R) by “consolidating files”. As different file-systems behave differently it’s recommended that you read up on spark.shuffle.consolidateFiles and see https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf.
Sometimes you may simply find that you need your DevOps to increase the number of inodes the FS supports.

EDIT

Consolidating files has been removed from spark since version 1.6. https://issues.apache.org/jira/browse/SPARK-9808

Loco answered 7/9, 2014 at 6:44 Comment(1)

What worked for me was I created a directory at my root and gave that path of that directory to .config("spark.local.dir",path) – Innocency 5/11, 2019 at 8:3

By default Spark uses the /tmp directory to store intermediate data. If you actually do have space left on some device -- you can alter this by creating the file SPARK_HOME/conf/spark-defaults.conf and adding the line. Here SPARK_HOME is wherever you root directory for the spark install is.

spark.local.dir                     SOME/DIR/WHERE/YOU/HAVE/SPACE

Melesa answered 27/8, 2015 at 15:28 Comment(1)

This works for me. In Jupyterhub + Spark, try to add: .config('spark.local.dir', 'SOME/DIR/WHERE/YOU/HAVE/SPACE') in your sparkConf, it works too. – Nessie 30/1, 2019 at 15:17

You need to also monitor df -i which shows how many inodes are in use.

on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks.

https://spark-project.atlassian.net/browse/SPARK-751

If you do indeed see that disks are running out of inodes to fix the problem you can:

Decrease partitions (see coalesce with shuffle = false).
One can drop the number to O(R) by “consolidating files”. As different file-systems behave differently it’s recommended that you read up on spark.shuffle.consolidateFiles and see https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf.
Sometimes you may simply find that you need your DevOps to increase the number of inodes the FS supports.

EDIT

Consolidating files has been removed from spark since version 1.6. https://issues.apache.org/jira/browse/SPARK-9808

Loco answered 7/9, 2014 at 6:44 Comment(1)

What worked for me was I created a directory at my root and gave that path of that directory to .config("spark.local.dir",path) – Innocency 5/11, 2019 at 8:3

I encountered a similar problem. By default, spark uses "/tmp" to save intermediate files. When the job is running, you can tab df -h to see the used space of fs mounted at "/" growing up. When the space of the dev is runned out of, this exception is thrown. To solve the problem, I set the SPARK_LOCAL_DIRS in the SPARK_HOME/conf/spark_defaults.conf with a path in a fs leaving enough space.

Suspend answered 30/12, 2015 at 6:54 Comment(0)

Another scenario for this error:

I have a spark-job which uses two sources of data (~150GB and ~100GB) and performs an inner join, many group-by, filtering, and mapping operations.
I created a 20 nodes(r3.2xlarge) spark-cluster using spark ec-2 scripts

Problem:

My job throwing error "No space left on device". As you can see my job requires so many shuffling, So to counter this problem I have used 20-nodes initially then increased to 40-nodes. Somehow the problem was still happening. I tried all other stuff like changing the spark.local.dir, repartitioning, Custom partitions, and parameter tuning(compression, spiling, memory, memory fraction, etc.) as much I could do. Also, I used instance type r3.2xlarge which has 1 x 160 SSD but the problem still happening.

Solution:

I logged into one of the nodes, and executed df -h / I found the node has only one mounted EBS volume(8GB) but there was no SSD(160GB). Then I looked into ls /dev/ and SSD was attached. This problem was not happening for all the nodes in the cluster. The error "No space left on device" happening for only those nodes which do not have SSD mounted. As they are dealing with only 8GB(EBS) and out of that ~4 GB space was available.

I created another bash script which launches the spark cluster using the spark-ec2 script then mount the disk after formatting it.

ec2-script to launch cluster
MASTER_HOST = <ec2-script> get-master $CLUSTER_NAME
ssh -o StrictHostKeyChecking=no root@$MASTER_HOST "cd /root/spark/sbin/ && ./slaves.sh mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/sdb && ./slaves.sh mount -o defaults,noatime,nodiratime /dev/sdb /mnt"

Margay answered 4/4, 2017 at 19:32 Comment(1)

are these scripts already available on the hosts? – Kepler 27/12, 2022 at 1:54

On the worker machine, set the environment variable "SPARK_LOCAL_DIRS" to the place you have free space. Setting the configuration variable "spark.local.dir" doesn't work from Spark 1.0 and later.

Grassquit answered 22/3, 2018 at 16:10 Comment(0)

Some other workarounds:

Explicitly removing the intermidiate shuffe files. If you don't want to keep the rdd for later computation, you can call .unpersist() which will flag the intermidiate shuffle files for removal (you can also re-assign the rdd variable to None).
Use more workers, adding more workers will reduce on average the number of intermidiate suffle file needed / worker.

More about the "No space left on device" error on this databricks thread: https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html

Shy answered 28/7, 2015 at 15:27 Comment(0)

What space is this?

Spark actually writes temporary output files from “map” tasks and RDDs to external storage called “scratch space”, and by default, “scratch space” is on local machine’s /tmp directory.

/tmp is usually the operating system’s (OS) temporary output directory, accessed by OS users, and /tmp is typically small and on a single disk. So when Spark runs lots of jobs, long jobs, or complex jobs, /tmp can fill up quickly, forcing Spark to throw “No space left on device” exceptions.

Because Spark constantly writes to and reads from its scratch space, disk IO can be heavy and can slow down your workload. The best way to resolve this issue and to boost performance is to give as many disks as possible to handle scratch space disk IO. To achieve both, explicitly define parameter spark.local.dir in spark-defaults.conf configuration file, as follows:

spark.local.dir /data1/tmp,/data2/tmp,/data3/tmp,/data4/tmp,/data5/tmp,/data6/tmp,/data7/tmp,/data8/tmp

The above comma-delimited setting will spread out Spark scratch space onto 8 disks (make sure each /data* directory is configured on a separate physical data disk), and under the /data*/tmp directories. You can create any sub directory names instead of ‘tmp’.

Source: https://developer.ibm.com/hadoop/2016/07/18/troubleshooting-and-tuning-spark-for-heavy-workloads/

Condensed answered 4/2, 2020 at 14:43 Comment(0)

Please change the SPARK_HOME directory, as we have to give the directory which has more space available for running our job smoothly.

Wealth answered 30/11, 2017 at 10:26 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags