Scheduling cron jobs on Google Cloud DataProc
Asked Answered
C

2

6

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.

What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.

Thanks in advance!

Clathrate answered 18/11, 2019 at 11:0 Comment(0)
B
5

For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.

If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.

Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:

I created a workflow in my project called terasort-example.

I then created a new Service Account in my project, called [email protected] and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.

After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:

Target: HTTP

URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json

HTTP Method: POST

Body: {}

Auth Header: OAuth Token

Service Account: [email protected]

Scope: (left blank)

You can test it by clicking Run Now.

Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json

Check out this official doc that discusses other scheduling options.

Bushwhack answered 12/12, 2019 at 1:29 Comment(0)
B
2

Please see the other answer for more comprehensive solution

What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.

Here's a complete example of using Cloud Function to trigger Dataproc: How can I run create Dataproc cluster, run job, delete cluster from Cloud Function

Bushwhack answered 18/11, 2019 at 12:34 Comment(2)
Thank you! This is extremely helpful. However, there is a limit on execution time for Cloud Functions. The maximum execution time offered is 9 minutes. If the runtime of creating a cluster, running the job on cluster and then deleting the cluster exceeds 9 minutes, the whole process may potentially fail. The work around I can think of is to use multiple Cloud Functions at each step (create a cluster, run the job, keep a check on job status and lastly delete the cluster once the job is over). Does that make sense?Clathrate
This is why I suggest using a WorkflowTemplate. Once started, Dataproc API takes care of submitting jobs and deleting the cluster. It also reacts to any errors along the way, so when it finishes the resources (clusters) are always . cleaned up.Bushwhack

© 2022 - 2024 — McMap. All rights reserved.