Is AWS Lambda preferred over AWS Glue Job?

Asked 26/8, 2020 at 14:29 Answered 27/9, 2023 at 4:14

In AWS Glue job, we can write some script and execute the script via job.

In AWS Lambda too, we can write the same script and execute the same logic provided in above job.

So, my query is not whats the difference between AWS Glue Job vs AWS Lambda, BUT iam trying to undestand when AWS Glue job should be preferred over AWS Lambda, especially while when both does the same job? If both does the same job, then ideally I would blindly prefer using AWS Lambda itself, right?

Please try to understand my query..

Conspicuous answered 26/8, 2020 at 14:29 Comment(2)

glue is for spark not python. – Corded 26/8, 2020 at 14:33

@Corded glue also supports python/pandas/pyspark. – Luellaluelle 1/1, 2022 at 14:36

Additional points:

Per this source and Lambda FAQ and Glue FAQ

Lambda can use a number of different languages (Node.js, Python, Go, Java, etc.) vs. Glue can only execute jobs using Scala or Python code.

Lambda can execute code from triggers by other services (SQS, Kafka, DynamoDB, Kinesis, CloudWatch, etc.) vs. Glue which can be triggered by lambda events, another Glue jobs, manually or from a schedule.

Lambda runs much faster for smaller tasks vs. Glue jobs which take longer to initialize due to the fact that it's using distributed processing. That being said, Glue leverages its parallel processing to run large workloads faster than Lambda. NOTE: Lambda jobs are specifically for 15 minute or less scripts. Anything more, and you want to use another tool.

Lambda looks to require more complexity/code to integrate into data sources (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.) while Glue can easily integrate with these. However, with the addition of Step Functions, multiple lambda functions can be written and ordered sequentially due reduce complexity and improve modularity where each function could integrate into a aws service (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.)

Glue looks to have a number of additional components, such as Data Catalog which is a central metadata repository to view your data, a flexible scheduler that handles dependency resolution/job monitoring/retries, AWS Glue DataBrew for cleaning and normalizing data with a visual interface, AWS Glue Elastic Views for combining and replicating data across multiple data stores, AWS Glue Schema Registry to validate streaming data schema.

There are other examples I am missing, so feel free to comment and I can update.

Vedanta answered 30/6, 2021 at 5:48 Comment(5)

Good list. I would add Step Functions to the list of AWS services that Lambda integrates with as this brings state machine functionality to data processing using Lambda. – Kath 30/6, 2021 at 14:29

Oh wow, I didn't know this integration existed. Pretty cool! So it looks to help with reducing complexity and improving code modularity if customer's existing processes already used lambda functions. @BillWeiner, would you say this helps bridge the gap between lambda and glue? Reading add. documentation here and in terms of ETL functionality, that looks to be the case (aws.amazon.com/step-functions) – Vedanta 30/6, 2021 at 20:29

Absolutely. Step functions allow for flexibility in overall execution of serverless workflows and enable cost effective polling processes for Lambda. It is my preferred method for implementing ETL/ELT workflows (data movement orchestration). Glue while easy to set up very often breaks down, wrong data types & incorrect expectations of data formats, and is a quagmire to modify its functionality. Classic AWS service issue - solves 70% of the problem easily but you SOL if your problem lands in the 30%. Lambda with Step Functions is easily understandable and flexible to meet all needs. – Kath 30/6, 2021 at 21:32

Good to know Bill, thanks for sharing this info! So Lambda can integrate with Redshift and other DBs, it just requires a bit more set up but worth it compared to the complexities of Glue? – Vedanta 30/6, 2021 at 22:7

The complexities that arise with Glue when you need anything outside of "normal". Yes, I'd rather spend some bounded upfront time and have a flexible, extensible solution than start out easy and have to reset late. – Kath 30/6, 2021 at 22:36

Lambda has a lifetime of fifteen minutes. It can be used to trigger a glue job as an event based activity. That is, when a file lands in S3 for example, we can have an event trigger which can run a glue job. Glue is a managed services for all data processing.

If the data is very low maybe you can do it in lambda, but for some reason the process goes beyond fifteen minutes, then data processing would fail.

Erogenous answered 26/8, 2020 at 17:35 Comment(0)

Simple, just don't think of serverless execution alone where you execute a piece code on the cloud. It is beyond that.

Here's the differences

Difference	AWS Lambda	AWS Glue
Execution Duration	15 minutes	48 hours
Use Case	Event-driven, serverless compute	ETL (Extract, Transform, Load) data processing
Programming Languages	Supports multiple languages (e.g., Python, Node.js, Java)	Python or Scala (ETL scripts)
Scaling	Automatic scaling based on demand (Invocations)	Horizontal scaling for ETL jobs (Spark Distributed processing
Distributed Processing	No support, custom code for thread(not recommended)	Out of the box Apache Spark support
Execution Model	Short-lived, event-driven	Long-running batch processing
Cost Model	Based on invocations and duration	Based on Data Processing Units (DPUs) and duration
Integration	Integrates with various AWS services and triggers	Specialized for AWS data sources and data stores
Latency	Low latency for handling events in real-time	Typically higher latency for batch processing
Execution Control	Triggered by events or schedules	Scheduled, event-driven, or on-demand
Complexity	More flexible but requires explicit coding	Simplified ETL tasks with built-in connectors, requires additional setup for custom libraries

Karlsruhe answered 27/9, 2023 at 4:14 Comment(0)

The answer to this can involve some foundational design decisions. What is this job doing? What kind of data are you dealing with? Is there a decision to be made whether the task should be executed in a batch or event oriented paradigm?

Batch

This may be necessary or desirable because the task:

Is being done over large monolithic data (e.g., binary).
Relies on context of multiple records in a dataset such that they must be loaded into a single job.
Order matters.

I feel like just as often I see batch handling chosen by default because "this is the way we've always done it" but breaking from this approach could be worth consideration.

Glue is built for batch operations. With a current maximum execution time of 15 minutes and maximum memory of 10gb, Lambda has become capable of processing fairly large datasets in a single execution, as well. It can be difficult to pin down a direct cost comparison without specifics of the workload. When it comes to development, I feel that Lambda has the edge as far as tooling to build, test, deploy.

Event

In the case where your data consists of a set of records, it might behoove you to parse and "stream" them into Lambda. Consider a flow like:

CSV lands in S3.
S3 event triggers Lambda.
Lambda reads and parses CSV into discrete events, submits to another Lambda or publishes to SNS for downstream processing. Concurrent instances of this Lambda can be employed to speed up ingest, where each instance is responsible for certain lines of the S3 object.

This pushes all logic and error handling, as well as resources required, to the level of individual event/record level. Often mechanisms such as dead-letter queues are employed for remediation. While context of a given container persists across invocations - assuming the container has not been idle and torn down - Lambda should generally be considered stateless such that the processing of an event/record is thought of as occurring within its own scope, outside that of others in the dataset.

Radioactivity answered 10/2, 2022 at 19:39 Comment(2)

great analysis, but also... I/O is freaking expensive. We do batches not because we've always been doing it this way, but because this limits the number of I/Os. If I have a million events to process, I would definitely be able to do it much faster and much cheaper by grabbing a few batches rather than invoking a million lambda instances. Each lambda invocation internally is an HTTP request and that takes time. And then what? You put that data into a DB? A million inserts in parallel will just kill your DB, but well managed batches will be handled very quickly without any hickups – Swath 22/7, 2022 at 9:21

Good points, Kamil. A few thoughts: - HTTP: in the era of SOA, this in of itself is not a negative. Loose-coupling brings inefficiencies but also benefits. - "faster and cheaper": Anecdotally, I see mixed opinion online about this. I may run some tests. Will report back if I do. - DB: This depends on data domains. If I have a domain for "orders," I may not want an ad hoc Glue job writing directly to my orders table. I would likely push ETL'd records to a SQS queue for the designated service/Lambda who owns the table to insert them (yes, optionally in batch off the queue and into the DB). – Radioactivity 14/9, 2022 at 18:22

Lambda has some limitation you can find lambda limitation here glue has also limitation here but it's much powerful than lambdas. you can compare the limitations and decide when to use glue

Avitzur answered 7/8, 2023 at 22:27 Comment(0)

Recommended topics

Hot tags