For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler
. Cloud Functions
are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer
is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler
:
I created a workflow in my project called terasort-example
.
I then created a new Service Account in my project, called [email protected]
and gave it Dataproc Editor
role; however something more restricted with just dataproc.workflows.instantiate
is also sufficient.
After enabling the the Cloud Scheduler
API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: [email protected]
Scope: (left blank)
You can test it by clicking Run Now
.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.