aws glue job dependency in step function
Asked Answered
T

3

8

I've created 2 glue jobs (gluejob1, gluejob2).

I want create a dependency as gluejob2 should run only after gluejob1 is completed.

To orchestrate this, I created a step function with below definition:

 {
  "gluejob1": {
    "Type": "Task",
    "Resource": "gluejob1.Arn",
    "Comment": "Glue job1.",
    "Next": "gluejob2"
  },

  "gluejob2": {
    "Type": "Task",
    "Resource": "gluejob2.Arn",
    "Comment": "TGlue job2.",
    "Next": "Gluejob2 Finished Loading"
  },
  "Gluejob2 Finished Loading": {
    "Type": "Pass",
    "Result": "",
    "End": true
  }
}

When I execute this step function, state function calls it a success the moment it triggers the Gluejob1 and moves on to trigger gluejob2.

I'm wondering if there is a possibility to run gluejob2 only after gluejob1 completes.

Thearchy answered 16/1, 2019 at 1:55 Comment(0)
H
15

You can invoke Glue job from StepFunction synchronously so that it will wait for job completion:

{
  "StartAt": "gluejob1",
  "States": {
    "gluejob1": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName.$": "ETLJobName1"
      },
      "Next": "gluejob2"
    },
    "gluejob2": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName.$": "ETLJobName2"
      },
      "Next": "Gluejob2 Finished Loading"
    },
    "Gluejob2 Finished Loading": {
      "Type": "Pass",
      "Result": "",
      "End": true
    }
}
Heliotype answered 16/1, 2019 at 23:19 Comment(5)
My step for Glue job still shows as In Progress even after the glue job is succeeded and hence, not progressing to the next step. Any idea why? Is it something to do with Cloudwatch event permissions?Ibeam
I noticed that it can take some time for a StepFunction to determine that job has completed (seems like internal job status poller runs every X minutes)Heliotype
Well it never got intimated to the step function that the glue job succeeded. Also noticed that the Cloudwatch events for Glue jobs are not getting triggered, tried to capture them using rules, but couldn't. Have already granted CloudWatchFullAccess to the service role with which the glue job is run. What am I missing?Ibeam
@Saud, you might have already solved it, but wondering if step function role has permissions to invoke Glue job. Also, what I noticed, sometimes services fail to communicate back to Step function(happens a lot with my lambda functions). So, I added retry logic. That solves services not communicating back to Step function or failed for some reason issues.Thearchy
You may need to make sure the step function's role has all the necessary IAM permissions on the Glue job, namely glue:StartJobRun, glue:GetJobRun, glue:GetJobRuns, and glue:BatchStopJobRun docs.aws.amazon.com/step-functions/latest/dg/glue-iam.htmlSchluter
P
5
  1. We need to first make sure our IAM Step Function Role has the ability to trigger our glue job to run as well as get a "call back" from the aws glue job when the job has been completed or failed. The policy for your step function should include this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:StartJobRun",
                "glue:GetJobRun",
                "glue:GetJobRuns",
                "glue:BatchStopJobRun"
            ],
            "Resource": "*"
        }
    ]
}
2. Now we can add the "StartJobRun" Step to trigger our Glue job. start glue job
  1. We need to enable "wait for job to complete" which appears in the visual workflow editor. This will make ensure that our step function will wait for this task to be completed before going to the next one. If this is not enabled it will just trigger and move immediately to your next glue job.

step function screenshot

I made a step-by-step tutorial walk-through on how to configure a step function to run glue jobs synchronously and how to configure the IAM policy correctly for the step function role so the step function will not have the hanging error that others have experienced:

https://youtu.be/-zm-1egM3hY

Phototonus answered 11/7, 2022 at 21:4 Comment(0)
U
0

Why not use triggers in glue to handle dependencies?

Ungovernable answered 16/1, 2019 at 16:25 Comment(1)
Keeping the flow logic outside of the Glue job will allow you to re-use the job without triggering other components. It's good practice as you can re-use a job for multiple data sources (you can pass arguments to a glue job)Ellis

© 2022 - 2024 — McMap. All rights reserved.