spark: java.io.IOException: No space left on device [again!]

Asked 3/7, 2017 at 14:32 Answered 23/9, 2017 at 16:5

I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr

df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)

myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>% 
  arrange(desc(mycount)) %>% head(10)

#this FAILS
get_result <- collect(myquery)

I do have set both

spark.local.dir <- "/mypath/"
spark.worker.dir <- "/mypath/"

using the usual

config <- spark_config()

config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"

  Sys.setenv(SPARK_HOME="mysparkpath")
  sc <- spark_connect(master = "spark://mynode", config = config)

where mypath has more than 5TB of disk space (I can see these options in the Environment tab). I tried a similar command in Pyspark and it failed the same way (same error).

By looking at the Stages tab in Spark, I see that the error occurs when shuffle write is about 60 GB. (input is about 200GB). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...

The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark

Every time I start a Spark job, I see a directory called spark-abcd-random_numbers in my /mypath folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)

there are about 40 parquet files. each is 700K (original csv files were 100GB) They contain strings essentially.
cluster is 10 nodes, each has 120GB RAM and 20 cores.

What is the problem here? Thanks!!

Gnosticize answered 3/7, 2017 at 14:32 Comment(15)

can you provide the spark command you re using to show what is the master and deploy mode. that could help a lot in your case – Illogic 10/7, 2017 at 8:56

@Illogic I am using slurm actually. I can see my master node by typing squeue -u myname. is this what you mean? – Send 10/7, 2017 at 10:56

I was looking for the spark-submit command that you issued, but I ve see you re using some other method to start you application. so the question become : how did you manage to set both spark.local.dir and spark.worker.dir properties ? do you have access to spark config files of the cluster? – Illogic 10/7, 2017 at 11:57

I set them as arguments in sparklyr (see updated question). In the Spark UI Environment tab, I can see these options appear. So I guess they are taken into account? – Send 10/7, 2017 at 12:52

@Illogic does that help? – Send 10/7, 2017 at 17:6

can you check with watch "df -u" while your job is running which disk fills? is it root volume? Then I'd check what kind of files are filling it up – Skiba 12/7, 2017 at 10:45

@IgorBerman thanks ! df-u prints gazillion of disks. Do I need to spam the command or the disk usage refreshes by itself on the screen? – Send 12/7, 2017 at 10:51

@Noobie, the point is to find which disk is filling up and why. watch command usually executes sub-command once in a while(every 2 secs by default, you can control it...) – Skiba 12/7, 2017 at 11:22

@IgorBerman the -u argument does not work for me. It that the right command?. you mean running df and du? – Send 12/7, 2017 at 12:35

@IgorBerman by spamming df -H I first saw /tmp/ filling up to 80% and then no further actitivy on this disk or on other disks visibles by hitting df -H Meanwhile shuffle write was going from 10GB to almost 80GB then the job crashed on the no disk space available. After the crash, no disk is actually 100% full. how can that be? – Send 12/7, 2017 at 13:12

@IgorBerman any ideas? thanks!! – Send 12/7, 2017 at 17:14

@Noobie, the only idea that I have - maybe you are using sparkR shell or something(? I haven't used it) and you put your application inside this shell, so what is really works is configuration of shell and not spark config that you are providing...you've already got advice of restarting machine, but if you have spark-slave process(CoarseGrained something, try to find with ps -ef) - you can restart it first. We've talked about dir - are you using spark local context? is it the only machine you are using? – Skiba 12/7, 2017 at 21:42

@IgorBerman I start the Spark job using a magpie script. Please see updated question. I set the parameters there basically. Do you see something that might help? – Send 13/7, 2017 at 1:20

@Noobie, have you tried to play with following setting in script: export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie" ? also export SPARK_LOCAL_DIR="/tmp/${USER}/spark" – Skiba 13/7, 2017 at 7:12

@Noobie Have you tried to modify the magpie script to provide the java options for the spark job? – Illogic 17/7, 2017 at 11:25

I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/ which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit command as the following:

$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....

In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):

config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"

Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)

I hope that will help

Illogic answered 10/7, 2017 at 17:12 Comment(9)

still getting the error.... the option appear in the Environment tab as spark.executor.extraJavaOptions -Djava.io.tmpdir=/mypath. is that correct? – Send 10/7, 2017 at 19:53

and what is the difference between the config$spark.worker.dir` <- "mypath/"` I was using? thx! – Send 10/7, 2017 at 19:55

the options are correct, they are both for the driver and the executors so you should also have spark.driver.extraJavaOptions -Djava.io.tmpdir=/mypath . The difference that working dir is a spark options where setting the java.io.tmpdir managing the java process of spark and will override all the properties in spark – Illogic 10/7, 2017 at 21:5

damn, do you see another way to make progress here? am i screwing up with the other cluster options? data is huge and cannot fit into ram – Send 10/7, 2017 at 21:9

also does it matter how the path is written? say \\mypath\myfolder or \\mypath\myfolder\ – Send 10/7, 2017 at 22:49

I was looking on the sparklyr docs, there is no reason that the configuration were not taken into consideration. I ll try to find out how to sort out that problem – Illogic 11/7, 2017 at 21:23

thanks I ll update the questions with more details about the data – Send 11/7, 2017 at 21:31

thanks man! my problem is not solved but i m sure the solutions on this page will help other people out! – Send 16/7, 2017 at 11:55

I ve just seen your update, since you re using magpie script there is a chance to add those config on the script: github.com/LLNL/magpie/blob/master/submission-scripts/… . here you can add export SPARK_JOB_JAVA_OPTS="-Djava.io.tmpdir=/mypath/" . do not forget to uncomment this line by removing the heading # – Illogic 16/7, 2017 at 12:8

change following settings in your magpie script

export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie" 
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"

to have mypath prefix and not /tmp

Skiba answered 13/7, 2017 at 7:17 Comment(2)

do I need to keep the user/magpie stuff? – Send 14/7, 2017 at 19:52

it's your decision, but I'd put it as is, so there will be clear differentiation – Skiba 15/7, 2017 at 13:0

Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.

Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.

Mittel answered 8/7, 2017 at 13:42 Comment(3)

what do you mean you restarted the machines? I cannot restart the machines every time I run a job – Send 8/7, 2017 at 13:52

set these parameters in spark-defaults.conf file and restart the servers. Then no need to pass these parameters from outside. – Mittel 8/7, 2017 at 13:53

If that is the case, try to set in gateway machine and check once. – Mittel 8/7, 2017 at 13:57

Since you need to set this when the JVM is launched via spark-submit, you need to use the sparklyr java-options, e.g.

config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"

Assiduity answered 20/9, 2017 at 21:34 Comment(0)

I had this very problem this week on a Standalone mode cluster and after trying different things, like some of the recommendations in this thread, it ended up being a sub folder called "work" inside the Spark home folder grew unchecked for a while thus filling up the worker's hhd

Noway answered 23/9, 2017 at 16:5 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags