Why are "sc.addFile" and "spark-submit --files" not distributing a local file to all workers?
Asked Answered
C

1

7

I have a CSV file "test.csv" that I'm trying to have copied to all nodes on the cluster.

I have a 4 node apache-spark 1.5.2 standalone cluster. There are 4 workers where one node also acts has master/driver as well as the worker.

If I run:

$SPARK_HOME/bin/pyspark --files=./test.csv OR from within the REPL interface execute sc.addFile('file://' + '/local/path/to/test.csv')

I see spark log the following:

16/05/05 15:26:08 INFO Utils: Copying /local/path/to/test.csv to /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
16/05/05 15:26:08 INFO SparkContext: Added file file:/local/path/to/test.csv at http://192.168.1.4:39578/files/test.csv with timestamp 1462461968158

In a separate window on the master/driver node, I can easily locate the file using ls, i.e. (ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv).

However if I log into the the workers, there is no file at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv and not even a folder at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b

But the apache spark web interface shows a job running and cores allocated on all nodes, also no other warnings or errors appear in the console.

Clamworm answered 5/5, 2016 at 15:57 Comment(2)
I believe that each worker manages user files independently. The log line Copying /local/path/to/test.csv (...) happened only in the driver. Each worker will then store the file at a different location based on its configuration, and point its name to the correct location.Artema
Ah I thought it used a deterministic folder structure everywhere, thanksClamworm
G
8

As Daniel commented, each worker manages files differently. If you want to access the added file, then you can use SparkFiles.get(file). If you want to see which directory your files are going to, then you can print the output of SparkFiles.getDirectory (now SparkFiles.getRootDirectory)

Gerdes answered 5/5, 2016 at 16:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.