If you want to have a bit more flexibility, i.e run files, modules or even a script specified on the command line, you can use something like the following launcher script:
launcher.py
import runpy
import sys
from argparse import ArgumentParser
def split_passthrough_args():
args = sys.argv[1:]
try:
sep = args.index('--')
return args[:sep], args[sep + 1:]
except ValueError:
return args, []
def main():
parser = ArgumentParser(description='Launch a python module, file or script')
source = parser.add_mutually_exclusive_group(required=True)
source.add_argument('-m', type=str, help='Module to run', dest='module')
source.add_argument('-f', type=str, help="File to run", dest='file')
source.add_argument('-c', type=str, help='Script to run', dest='script')
parser.add_argument('--', nargs='*', help='Arguments', dest="arg")
self_args, child_args = split_passthrough_args()
args = parser.parse_args(self_args)
sys.argv = [sys.argv[0], *child_args]
if args.file:
runpy.run_path(args.file, {}, "__main__")
elif args.module:
runpy.run_module(f'{args.module}.__main__', {}, "__main__")
else:
runpy._run_code(args.script, {}, {}, "__main__")
if __name__ == "__main__":
main()
It tries to emulate the Python interpreter's behavior, so when you have a package with the following module hierarchy
mypackage
mymodule
__init__.py
__main__.py
where __main__.py
contains the following:
import sys
if __name__ == "__main__":
print(f"Hello {sys.argv[1]}!")
which you built and packaged as mypackage.whl
; you can run it with
spark-submit --py-files mypackage.whl launcher.py -m mypackage.mymodule -- World
Supposing the package is preinstalled and available on /my/path/mypackage on the driver:
spark-submit launcher.py -f /my/path/mypackage/mymodule/__main__.py -- World
You could even submit a script:
spark-submit launcher.py -c "import sys; print(f'Hello {sys.argv[1]}')" -- World