I have a spark job that runs via a Kubernetes pod . Till now I was using an Yaml file to run my jobs manually. Now , I want to schedule my spark jobs via airflow. This is the first time I am using airflow and I am unable to figure out how I can add my Yaml file in the airflow. From what I have read is that I can schedule my jobs via a DAG in Airflow. A dag example is this :
from airflow.operators import PythonOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {'owner':'test', 'start_date' : datetime(2019, 4, 3), 'retries': 2, 'retry_delay': timedelta(minutes=1) }
dag = DAG('test_dag', default_args = args, catchup=False)
def print_text1():
print("hell-world1")
def print_text():
print('Hello-World2')
t1 = PythonOperator(task_id='multitask1', python_callable=print_text1, dag=dag)
t2 = PythonOperator(task_id='multitask2', python_callable=print_text, dag=dag)
t1 >> t2
In this case the above methods will get executed on after the other once I play the DAG. Now , in case I want to run a spark submit job , what should I do? I am using Spark 2.4.4