Spark saveAsTextFile() results in Mkdirs failed to create for half of the directory
Asked Answered
A

11

14

I am currently running a Java Spark Application in tomcat and receiving the following exception:

Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.json/_temporary/0/_temporary/attempt_201603031703_0001_m_000000_5

on the line

text.saveAsTextFile("/opt/folder/tmp/file.json") //where text is a JavaRDD<String>

The issue is that /opt/folder/tmp/ already exists and successfully creates up to /opt/folder/tmp/file.json/_temporary/0/ and then it runs into what looks like a permission issue with the remaining part of the path _temporary/attempt_201603031703_0001_m_000000_5 itself, but I gave the tomcat user permissions (chown -R tomcat:tomcat tmp/ and chmod -R 755 tmp/) to the tmp/ directory. Does anyone know what could be happening?

Thanks

Edit for @javadba:

[root@ip tmp]# ls -lrta 
total 12
drwxr-xr-x 4 tomcat tomcat 4096 Mar  3 16:44 ..
drwxr-xr-x 3 tomcat tomcat 4096 Mar  7 20:01 file.json
drwxrwxrwx 3 tomcat tomcat 4096 Mar  7 20:01 .

[root@ip tmp]# cd file.json/
[root@ip file.json]# ls -lrta 
total 12
drwxr-xr-x 3 tomcat tomcat 4096 Mar  7 20:01 _temporary
drwxrwxrwx 3 tomcat tomcat 4096 Mar  7 20:01 ..
drwxr-xr-x 3 tomcat tomcat 4096 Mar  7 20:01 .

[root@ip file.json]# cd _temporary/
[root@ip _temporary]# ls -lrta 
total 12
drwxr-xr-x 2 tomcat tomcat 4096 Mar  7 20:01 0
drwxr-xr-x 3 tomcat tomcat 4096 Mar  7 20:01 ..
drwxr-xr-x 3 tomcat tomcat 4096 Mar  7 20:01 .

[root@ip _temporary]# cd 0/
[root@ip 0]# ls -lrta 
total 8
drwxr-xr-x 3 tomcat tomcat 4096 Mar  7 20:01 ..
drwxr-xr-x 2 tomcat tomcat 4096 Mar  7 20:01 .

The exception in catalina.out

Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.json/_temporary/0/_temporary/attempt_201603072001_0001_m_000000_5
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more
Aksoyn answered 3/3, 2016 at 17:13 Comment(1)
can you post how you submit your application? what master? and do you use speculation by any chance?Liederkranz
C
20

saveAsTextFile is really processed by Spark executors. Depending on your Spark setup, Spark executors may run as a different user than your Spark application driver. I guess the spark application driver prepares the directory for the job fine, but then the executors running as a different user have no rights to write in that directory.

Changing to 777 won't help, because permissions are not inherited by child dirs, so you'd get 755 anyways.

Try running your Spark application as the same user that runs your Spark.

Copestone answered 14/3, 2016 at 12:21 Comment(2)
Most probable response. I just set up my application to use HDFS/S3 as a work around. Did not run into the issues with permissions, if I get the chance Ill backtrack and confirm this.Aksoyn
how do you run the spark executors under the same user?Manon
C
3

I suggest to try changing to 777 temporarily . See if it works at that point. There have been bugs/issues wrt permissions on local file system. If that still does not work let us know if anything changed or precisely same result.

Cow answered 3/3, 2016 at 18:9 Comment(3)
Tried that as well before the 755 the result is the same unfortunately.Aksoyn
please show us output of ls -lrta /opt/folder/tmp/file.json/_temporary/0/_temporaryCow
Sorry for the delay. Added the update. Thats as far as it goes. It cannot mkdir past 0/ it would seemAksoyn
H
2

I also had the same problem, and my issue has been resolved by using full HDFS path:

Error

Caused by: java.io.IOException: Mkdirs failed to create file:/QA/Gajendra/SparkAutomation/Source/_temporary/0/_temporary/attempt_20180616221100_0002_m_000000_0 (exists=false, cwd=file:/home/gajendra/LiClipse Workspace/SpakAggAutomation)

Solution

Use full HDFS path with hdfs://localhost:54310/<filePath>

hdfs://localhost:54310/QA/Gajendra/SparkAutomation

Henrik answered 16/6, 2018 at 17:21 Comment(0)
C
1

Could it be selinux/apparmor that plays you a trick? Check with ls -Z and system logs.

Constructivism answered 10/3, 2016 at 7:24 Comment(0)
U
1

Giving the full path works for me. Example:

 file:/Users/yourname/Documents/electric-chargepoint-2017-data
Unexperienced answered 28/4, 2022 at 20:44 Comment(1)
This works, but the path has to be a path that does not exist yet or an exception will be thrown. I got around this by using a timestamp to name the folder.Stockholm
B
0

So, I've been experiencing the same issue, with my setup there is no HDFS and Spark is running in stand-alone mode. I haven't been able to save spark dataframes to an NFS share using the native Spark methods. The process runs as a local user, and I try to write to the users home folder. Even when creating a subfolder with 777 I cannot write to the folder.

The workaround for this is to convert the dataframe with toPandas() and after that to_csv(). This magically works.

Bandoline answered 1/12, 2017 at 10:5 Comment(0)
G
0

I have the same issue as yours.

I also did not want to write to hdfs but to a local memory share.

After some research, I found that for my case the reason is: there are several nodes executing, however, some of the nodes has no access to the directory where you want to write your data.

Thus the solution is to make the directory available to all nodes, and then it works~

Gantline answered 10/10, 2018 at 8:23 Comment(1)
how do you do thatLaudianism
Z
0

We need to run the application in local mode.

 val spark = SparkSession
      .builder()
      .config("spark.master", "local")
      .appName("applicationName")
      .getOrCreate()
Zerla answered 7/4, 2021 at 7:41 Comment(0)
C
0

If the transformed data fits into master node memory, convert pyspark dataframe to pandas dataframe and then save the pandas dataframe to file.

Chewink answered 13/4, 2023 at 18:2 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Pinna
S
0

A solution that works for small files can be to convert the pyspark dataframe into a Pandas dataframe and then export it as a csv df.toPandas.to_csv(myDf.csv), but this is not feasible because of memory limitations and performance overhead of pandas dataframes.

Stockholm answered 28/11, 2023 at 6:39 Comment(0)
H
-1

this is tricky one, but simple to solve. One must configure job.local.dir variable to point to working directory. Following code works fine with writing CSV file:

def xmlConvert(spark):
    etl_time = time.time()
    df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load(
        '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/train/')
    df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum(
        "TagValue").na.fill(0)
    df.repartition(1).write.csv(
        path="/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/result/",
        mode="overwrite",
        header=True,
        sep=",")
    print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time))


if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .appName('XML ETL') \
        .master("local[*]") \
        .config('job.local.dir', '/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \
        .config('spark.driver.memory','64g') \
        .config('spark.debug.maxToStringFields','200') \
        .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \
        .getOrCreate()

    print('Session created')

    try:
        xmlConvert(spark)

    finally:
        spark.stop()
Hiddenite answered 31/5, 2019 at 16:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.