Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
Asked Answered
B

1

4

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.

Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.

What is best way to track Spark job using Airflow if even I submitted?

Bertilla answered 17/1, 2019 at 3:36 Comment(0)
E
3

My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:

  • Specifying remote master IP: Requires modifying global configurations / environment variables
  • Using SSHOperator: SSH connection might break
  • Using EmrAddStepsOperator: Dependent on EMR

Regarding tracking

  • Livy only reports state and not progress (% completion of stages)
  • If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)

Other considerations

  • Livy doesn't support reusing SparkSession for POST/batches request
  • If that's imperative, you'll have to write your application code in PySpark and use POST/session requests

References


Useful links

Enteritis answered 17/1, 2019 at 5:29 Comment(1)
Thank you @Enteritis . My source code is in scala and want to run as Application using jar. It looks like Livy is better option and i will submit using batches with POST method. Tracking progress is problematic but I think for now tracking status should be good enough.Bertilla

© 2022 - 2024 — McMap. All rights reserved.