I have a CSV file "test.csv" that I'm trying to have copied to all nodes on the cluster.
I have a 4 node apache-spark 1.5.2 standalone cluster. There are 4 workers where one node also acts has master/driver as well as the worker.
If I run:
$SPARK_HOME/bin/pyspark --files=./test.csv
OR from within the REPL interface execute sc.addFile('file://' + '/local/path/to/test.csv')
I see spark log the following:
16/05/05 15:26:08 INFO Utils: Copying /local/path/to/test.csv to /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
16/05/05 15:26:08 INFO SparkContext: Added file file:/local/path/to/test.csv at http://192.168.1.4:39578/files/test.csv with timestamp 1462461968158
In a separate window on the master/driver node, I can easily locate the file using ls, i.e. (ls -al /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
).
However if I log into the the workers, there is no file at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b/userFiles-a4cb1723-e118-4f0b-9f26-04be39e5e28d/test.csv
and not even a folder at /tmp/spark-5dd7fc83-a3ef-4965-95ba-1b62955fb35b
But the apache spark web interface shows a job running and cores allocated on all nodes, also no other warnings or errors appear in the console.
Copying /local/path/to/test.csv (...)
happened only in the driver. Each worker will then store the file at a different location based on its configuration, and point its name to the correct location. – Artema