AWS Lambda/ Aws Batch work flow
Asked Answered
D

1

7

I have written a lambda that is triggered off s3 bucket to unzip a zip file and process a text document inside. Due to the limitation of memory of lambda i need to move my process over to something like AWS batch. Correct me if I am wrong but my work flow should look something like this.

work flow

I beleive I need to write a lambda to put the location of the s3 bucket on amazons SQS were a AWS batch can read the location and do all the unzipping/data processing their were their is more memory.

Here is my current lambda, it takes in the event triggered by the s3 bucket, checks to see if it is a zip file then pushes the name of that s3 Key to SQS. Should I tell AWS batch to start reading the queue here in my lambda? I am totally new to AWS in general and not sure were to go from here.

public class dockerEventHandler implements RequestHandler<S3Event, String> {

private static BigData app = new BigData();
private static DomainOfConstants CONST = new DomainOfConstants();
private static Logger log = Logger.getLogger(S3EventProcessorUnzip.class);

private static AmazonSQS SQS;
private static CreateQueueRequest createQueueRequest;
private static Matcher matcher;
private static String srcBucket, srcKey, extension, myQueueUrl;

@Override
public String handleRequest(S3Event s3Event, Context context) 
{
    try {
        for (S3EventNotificationRecord record : s3Event.getRecords())
        {
            srcBucket = record.getS3().getBucket().getName();
            srcKey = record.getS3().getObject().getKey().replace('+', ' ');
            srcKey = URLDecoder.decode(srcKey, "UTF-8");
            matcher = Pattern.compile(".*\\.([^\\.]*)").matcher(srcKey);

            if (!matcher.matches()) 
            {
                log.info(CONST.getNoConnectionMessage() + srcKey);
                return "";
            }
            extension = matcher.group(1).toLowerCase();

            if (!"zip".equals(extension)) 
            {
                log.info("Skipping non-zip file " + srcKey + " with extension " + extension);
                return "";
            }
            log.info("Sending object location to key" + srcBucket + "//" + srcKey);

            //pass in only the reference of where the object is located
            createQue(CONST.getQueueName(), srcKey);
        }
    }
    catch (IOException e)
    {
        log.error(e);           
    }
    return "Ok";
} 

/*
 * 
 * Setup connection to amazon SQS
 * TODO - Find updated api for sqs connection to eliminate depreciation
 * 
 * */
@SuppressWarnings("deprecation")
public static void sQSConnection() {
    app.setAwsCredentials(CONST.getAccessKey(), CONST.getSecretKey());       
    try{
        SQS = new AmazonSQSClient(app.getAwsCredentials()); 
        Region usEast1 = Region.getRegion(Regions.US_EAST_1);
        SQS.setRegion(usEast1);
    } 
    catch(Exception e){
        log.error(e);       
    }
}

//Create new Queue
public static void createQue(String queName, String message){
    createQueueRequest = new CreateQueueRequest(queName);
    myQueueUrl = SQS.createQueue(createQueueRequest).getQueueUrl();
    sendMessage(myQueueUrl,message);
}

//Send reference to the s3 objects location to the queue
public static void sendMessage(String SIMPLE_QUE_URL, String S3KeyName){
    SQS.sendMessage(new SendMessageRequest(SIMPLE_QUE_URL, S3KeyName));
}

//Fire AWS batch to pull from que
private static void initializeBatch(){
    //TODO
}

I have setup docker and understand docker images. I believe my docker image should contain all the code to read the queue, unzip, process and kit the file to RDS all in one docker image/container.

I am looking for someone who has something similar done they could share to help. Something along the lines of :

Mr. S3: Hey lambda I have a file

Mr. Lambda :Okay S3 I see you, hey aws batch could you unzip and do stuff to this

Mr. Batch: Gotchya mr lambda, ill take care of that and put it in RDS or some data base after.

I have not written the class/docker image yet but i have all the code done to process/unzip and kick off to rds done. Lambda just is limited to memory due to some of the files being 1gb or bigger.

Decennial answered 22/6, 2017 at 15:6 Comment(1)
Maybe a different direction can anyone show me and example of a lambda that triggers a EMR spark or something off an incoming S3 bucket trigger?Decennial
D
8

Okay so after looking through the AWS docs on Batch, you don't need an SQS queue. Batch has a concept called Job Queue which is similar to an SQS FIFO queue, but different in that these job queues have priorities, and jobs within them can have dependencies on other jobs. The basic process is:

  1. First the weird part is setting up IAM roles so that container agents can talk to the container service, and AWS batch is able to launch various instances when it needs to (there's also a separate role needed for if you do spot instances). The details on permissions required can be found in this doc (PDF) at around page 54.
  2. Now when that's done you setup a compute environment. These are EC2 on-demand or spot instances which hold your containers. Jobs operate on a container level. The idea is that your compute environment is the max resource allocation that your job containers can utilize. Once that limit is hit, your jobs have to wait for resources to be freed up.
  3. Now you create a job queue. This associates jobs with the compute environment you created.
  4. Now you create a job definition. Well, technically you don't have to and can do it through lambda but this makes things a bit easier. Your job definition will indicate what container resources will be needed for your job ( you can of course override this in lambda as well )
  5. Now that this is all done you'll want to create a lambda function. This will be triggered by your S3 bucket event. The function will need necessary IAM permissions to run submit job against the batch service (as well as any other permissions). Basically all the lambda needs to do is call submit job to AWS batch. The basic parameters you'll want are the job queue and the job definition. You'll also set the S3 key for the zip needed as a parameter to the job.
  6. Now when the appropriate S3 event is triggered, it calls lambda, which then submits the job to the AWS batch job queue. Then assuming the setup is all good it will happily pull up resources to process your job. Note that depending on EC2 instance size and container resources allocated this may take a bit (much longer than prepping a Lambda function).
Destiny answered 27/6, 2017 at 3:24 Comment(3)
Thank you, this clears up a bit with the documentation. I'm not sure why but I had a hard time understanding the documentation but I am also pretty new to AWS and cloud computing in general. So opted not to use SQS because I did start to notice that it wasn't the same queue as batches queue. Instead I have my lambda sending each new s3 key (zip) that comes in on a PUT trigger to a database to store all the records/s3keys. From their i'll just read from the data table in batch/docker container then delete it when done. Seams to be the simplistic solution. I cant find java api to submit job ??Decennial
@JohnHanewich The Java API call for submit job can be found here. Another option is to use SQS to store the S3 key, then have CloudWatch trigger an alarm which spins up an instance. That instance (with the proper IAM role) can utilize a user data script which runs the actual script to process data (either custom AMI or have user data pull the script from somewhere).Destiny
Just say no to SQSJankowski

© 2022 - 2024 — McMap. All rights reserved.