I am getting the java.io.IOException: No space left on device
that occurs after running a simple query in sparklyr
. I use both last versions of Spark
(2.1.1) and Sparklyr
df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)
myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>%
arrange(desc(mycount)) %>% head(10)
#this FAILS
get_result <- collect(myquery)
I do have set both
spark.local.dir <- "/mypath/"
spark.worker.dir <- "/mypath/"
using the usual
config <- spark_config()
config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"
Sys.setenv(SPARK_HOME="mysparkpath")
sc <- spark_connect(master = "spark://mynode", config = config)
where mypath
has more than 5TB of disk space (I can see these options in the Environment
tab). I tried a similar command in Pyspark
and it failed the same way (same error).
By looking at the Stages
tab in Spark
, I see that the error occurs when shuffle write
is about 60 GB
. (input is about 200GB
). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...
The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark
Every time I start a Spark job, I see a directory called spark-abcd-random_numbers
in my /mypath
folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)
- there are about 40 parquet files. each is
700K
(originalcsv
files were 100GB) They contain strings essentially. - cluster is 10 nodes, each has 120GB RAM and 20 cores.
What is the problem here? Thanks!!
slurm
actually. I can see my master node by typingsqueue -u myname
. is this what you mean? – Sendspark-submit
command that you issued, but I ve see you re using some other method to start you application. so the question become : how did you manage to set bothspark.local.dir
andspark.worker.dir
properties ? do you have access to spark config files of the cluster? – Illogicsparklyr
(see updated question). In the Spark UI Environment tab, I can see these options appear. So I guess they are taken into account? – Send-u
argument does not work for me. It that the right command?. you mean runningdf
anddu
? – Senddf -H
I first saw/tmp/
filling up to 80% and then no further actitivy on this disk or on other disks visibles by hittingdf -H
Meanwhile shuffle write was going from 10GB to almost 80GB then the job crashed on the no disk space available. After the crash, no disk is actually 100% full. how can that be? – Send