How to execute spark submit on amazon EMR from Lambda function?

Asked 21/8, 2017 at 11:19 Answered 30/10, 2018 at 15:33

amazon-web-services apache-spark aws-lambda amazon-emr spark-submit

I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function.

Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step.

Epicurus answered 21/8, 2017 at 11:19 Comment(0)

You can, I had to same thing last week!

Using boto3 for Python (other languages would definitely have a similar solution) you can either start a cluster with the defined step, or attach a step to an already up cluster.

Defining the cluster with the step

def lambda_handler(event, context):
    conn = boto3.client("emr")        
    cluster_id = conn.run_job_flow(
        Name='ClusterName',
        ServiceRole='EMR_DefaultRole',
        JobFlowRole='EMR_EC2_DefaultRole',
        VisibleToAllUsers=True,
        LogUri='s3n://some-log-uri/elasticmapreduce/',
        ReleaseLabel='emr-5.8.0',
        Instances={
            'InstanceGroups': [
                {
                    'Name': 'Master nodes',
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'MASTER',
                    'InstanceType': 'm3.xlarge',
                    'InstanceCount': 1,
                },
                {
                    'Name': 'Slave nodes',
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'CORE',
                    'InstanceType': 'm3.xlarge',
                    'InstanceCount': 2,
                }
            ],
            'Ec2KeyName': 'key-name',
            'KeepJobFlowAliveWhenNoSteps': False,
            'TerminationProtected': False
        },
        Applications=[{
            'Name': 'Spark'
        }],
        Configurations=[{
            "Classification":"spark-env",
            "Properties":{},
            "Configurations":[{
                "Classification":"export",
                "Properties":{
                    "PYSPARK_PYTHON":"python35",
                    "PYSPARK_DRIVER_PYTHON":"python35"
                }
            }]
        }],
        BootstrapActions=[{
            'Name': 'Install',
            'ScriptBootstrapAction': {
                'Path': 's3://path/to/bootstrap.script'
            }
        }],
        Steps=[{
            'Name': 'StepName',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
                'Args': [
                    "/usr/bin/spark-submit", "--deploy-mode", "cluster",
                    's3://path/to/code.file', '-i', 'input_arg', 
                    '-o', 'output_arg'
                ]
            }
        }],
    )
    return "Started cluster {}".format(cluster_id)

Attaching a step to an already running cluster

As per here

def lambda_handler(event, context):
    conn = boto3.client("emr")
    # chooses the first cluster which is Running or Waiting
    # possibly can also choose by name or already have the cluster id
    clusters = conn.list_clusters()
    # choose the correct cluster
    clusters = [c["Id"] for c in clusters["Clusters"] 
                if c["Status"]["State"] in ["RUNNING", "WAITING"]]
    if not clusters:
        sys.stderr.write("No valid clusters\n")
        sys.stderr.exit()
    # take the first relevant cluster
    cluster_id = clusters[0]
    # code location on your emr master node
    CODE_DIR = "/home/hadoop/code/"

    # spark configuration example
    step_args = ["/usr/bin/spark-submit", "--spark-conf", "your-configuration",
                 CODE_DIR + "your_file.py", '--your-parameters', 'parameters']

    step = {"Name": "what_you_do-" + time.strftime("%Y%m%d-%H:%M"),
            'ActionOnFailure': 'CONTINUE',
            'HadoopJarStep': {
                'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
                'Args': step_args
            }
        }
    action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])
    return "Added step: %s"%(action)

Apathy answered 29/8, 2017 at 9:30 Comment(4)

What is this script-runner.jar if I need to add the s3 path for mey code file? – Cayes 5/9, 2017 at 4:26

The s3n://elasticmapreduce bucket is provided by Amazon. You don't have to do anything besides referencing it. – Apathy 24/10, 2018 at 23:20

Will this spark-submit action be synchronous call from lambda function or it will just add job flow without actually invoking it?? – Thumbsdown 21/11, 2019 at 21:15

One quick question, is there any way by which we can kill an existing spark application from lambda function? – Merete 31/7, 2022 at 15:13

AWS Lambda function python code if you want to execute Spark jar using spark submit command:

from botocore.vendored import requests

import json

def lambda_handler(event, context):

headers = { "content-type": "application/json" }

  url = 'http://ip-address.ec2.internal:8998/batches'

  payload = {

    'file' : 's3://Bucket/Orchestration/RedshiftJDBC41.jar 
s3://Bucket/Orchestration/mysql-connector-java-8.0.12.jar 

s3://Bucket/Orchestration/SparkCode.jar',

    'className' : 'Main Class Name',

    'args' : [event.get('rootPath')]

  }

  res = requests.post(url, data = json.dumps(payload), headers = headers, verify = False)

  json_data = json.loads(res.text)

  return json_data.get('id')

Sandal answered 30/10, 2018 at 15:33 Comment(2)

Could you improve the English in the first sentence and format the code? – Burglary 30/10, 2018 at 16:3

This uses livy to submit the Spark job. While the job can be run this way, many clusters do not have Livy configured, and therefore this method has it's limitations – Duodenary 14/4, 2021 at 14:1

Defining the cluster with the step

Attaching a step to an already running cluster

Recommended topics

Hot tags