Submitting pyspark job with multiple python files and one configuration file
Asked Answered
M

1

6

I have 4 python scripts and one configuration file of .txt . out of 4 python files , one file has entry point for spark application and also importing functions from other python files . But configuration file is imported in some other python file that is not entry point for spark application . I want to write spark submit command in pyspark , but I am not sure how to provide multiple files along configuration file with spark submit command when configuration file is not python file but text file or ini file.

for demonstration: 4 python files : file1.py , file2.py , file3.py . file4.py

1 configuration file : conf.txt

file1.py : this file has spark session and calling to all other python files . file3.py : this python file is reading conf.txt .

I want to provide all these files with spark submit but not sure about command . One solution I have identified is :

spark-submit --master local  --driver-memory 2g --executor-memory  2g --py-files s3_path\file2.py,s3_path\file3.py,s3_path\file4.py  s3_path\file1.py

but with above spark submit I am not sure how to pass conf.txt .

Manouch answered 24/9, 2020 at 9:3 Comment(0)
C
2

You can use --files in order to provide list of files to be uploaded with the application.


For instance,

spark-submit file1.py \
    --master local \
    --driver-memory 2g \
    --executor-memory 2g \
    --py-files file2.py,file3.py,file4.py \
    --files conf.txt

If your files are located in a S3 instance, you can try the following:

spark-submit s3://path/to/file1.py \
    --master local \
    --driver-memory 2g \
    --executor-memory 2g \
    --py-files s3://path/to/file2.py,s3://path/to/file3.py,s3://path/to/file4.py \
    --files s3://path/to/conf.txt
Crapulous answered 24/9, 2020 at 9:13 Comment(3)
I am running spark submit on aws emr :spark-submit --master local f'{s3_path}/file1.py', '--py-files', f'{s3_path}/file2.py', f'{s3_path}/file3.py', f'{s3_path}/file4.py', '--files', f'{s3_path}/config.txt' but this command is not working and giving error can not found module from file2 , since i have imported file2 in file1 .Manouch
@Manouch I've updated my answer. Let me know if it did the trick for youCrapulous
'Args': ['spark-submit','--deploy-mode', 'cluster','--master', 'yarn','--executor-memory', conf['emr_step_executor_memory'],'--executor-cores', conf['emr_step_executor_cores'],'--conf','spark.yarn.submit.waitAppCompletion=true','--conf','spark.rpc.message.maxSize=1024',f'{s3_path}/file1.py', '--py-files',f'{s3_path}/file2.py',f'{s3_path}/file3.py',f'{s3_path}/file4.py','--files', f'{s3_path}/config.txt' ] : I am running above command and this does not work and gives me issue module is not found .Manouch

© 2022 - 2024 — McMap. All rights reserved.