Why are "sc.addFile" and "spark-submit --files" not distributing a local file to all workers?

I have a CSV file "test.csv" that I'm trying to have copied to all nodes on the cluster.

I have a 4 node apache-spark 1.5.2 standalone cluster. There are 4 workers where one node also acts has master/driver as well as the worker.

If I run:

$SPARK_HOME/bin/pyspark --files=./test.csv OR from within the REPL interface execute sc.addFile('file://' + '/local/path/to/test.csv')

I see spark log the following:

16/05/05 15:26:08 INFO Utils: Copying /local/path/to/test.csv to /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
16/05/05 15:26:08 INFO SparkContext: Added file file:/local/path/to/test.csv at http://192.168.1.4:39578/files/test.csv with timestamp 1462461968158

In a separate window on the master/driver node, I can easily locate the file using ls, i.e. (ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv).

However if I log into the the workers, there is no file at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv and not even a folder at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b

But the apache spark web interface shows a job running and cores allocated on all nodes, also no other warnings or errors appear in the console.

Recommended topics

Hot tags