spark-submit --py-files gives warning RuntimeWarning: Failed to add file <abc.py> speficied in 'spark.submit.pyFiles' to Python path:
Asked Answered
C

1

15

We have a pyspark based application and we are doing a spark-submit as shown below. Application is working as expected, however we are seeing a weird warning message. Any way to handle this or why is this coming ?

Note: The cluster is Azure HDI Cluster.

spark-submit --master yarn  --deploy-mode cluster --jars file:/<localpath>/* --py-files pyFiles/__init__.py,pyFiles/<abc>.py,pyFiles/<abd>.py  --files files/<env>.properties,files/<config>.json main.py

warning seen is:

warnings.warn( /usr/hdp/current/spark3-client/python/pyspark/context.py:256: RuntimeWarning: Failed to add file [file:///home/sshuser/project/pyFiles/abc.py] speficied in 'spark.submit.pyFiles' to Python path:
/mnt/resource/hadoop/yarn/local/usercache/sshuser/filecache/929

above warning coming for all files i.e abc.py, abd.py etc (which ever passed to --py-files to)

Contractor answered 13/7, 2021 at 8:57 Comment(1)
You ever figure this out?Autostrada
O
0

Since Spark is open source, we can check the code which is raising the warning, at https://github.com/apache/spark/blob/master/python/pyspark/context.py#L350

There we can see that Spark is effectively executing something like this:

from pyspark import SparkFiles
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.appName("MyApp").getOrCreate()
    path = spark.conf.get("spark.submit.pyFiles").split(',')[0]
    (dirname, filename) = os.path.split(path)
    filepath = os.path.join(SparkFiles.getRootDirectory(), filename)
    if not os.path.exists(filepath):
        shutil.copyfile(path, filepath)

Basically, it tries to copy your files from the original location to the Spark application root directory, so that they can be found. If you run this code, you will get the actual exception that is hidden by Spark.

Example In my case, we are using livy to submit files in an Azure blob storage to Yarn, and the exception raised by shutil is FileNotFoundError: [Errno 2] No such file or directory: 'abfss://[email protected]/myappid/28_02_2023_15_33_56_146/pyFiles/imported_file.py', I guess because shutil cannot handle abfs filepaths.

However, Yarn already copies the files from the original location to livy filecache (as can be seen in hadoop-yarn-nodemanager.log) and I believe this new location is already in our pythonpath, therefore Spark does not need to copy files, and we can safely ignore the warning.

Ordway answered 28/2, 2023 at 16:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.