AWS Glue vs EMR Serverless
Asked Answered
A

2

17

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very promising service.

From my understanding - AWS Glue is a managed service on top of Apache Spark (for transformation layer). AWS EMR is mostly used for Apache Spark as well. So EMR Serverless (for Apache Spark) looks like is something pretty much similar to AWS Glue.

Right now I have one question in my mind - what is the core difference from AWS Glue and when to choose EMR Serverless over Glue?

Potentially EMR Serverless, may be even a part of AWS Glue ecosystem for transformation layer? Maybe AWS is going to replace the transformation layer in AWS Glue with EMR Serverless, and then it may make sense. AWS Glue will play a role of ETL Overlay, Metastore with EMR Serverless as processing layer.

Azaleah answered 12/12, 2021 at 8:10 Comment(5)
Don't you mean different between Athena and EMR?Riddle
No, I mean AWS Glue vs EMR Serverless. AWS Glue is a managed service on top of Apache Spark (for transformation layer). AWS EMR is mostly used for Apache Spark as well. So EMR Serverless(for Apache Spark) looks like is something pretty much similar to AWS Glue. And this is what my question about.Azaleah
Now I see what is confusing for you.. Both services may be built on top of similar technology/components (pyspark), but they have different level and use case. I don't thing the services will be merged or replaced. As an analogy, you can compare services like ECS and RDS. You can run a database on ECS with some effort and maintenance, but that not the purpose and use case.Riddle
@Riddle thanks for your answer, but please carefully read my question. Skip metastore and other Glue features and be focused only on the processing layer.Azaleah
If you talk about "Glue Jobs" you might be more precise about what exactly we are talking about. I see a lot of confusion with Glue Tables, Glue Crawlers etc.Perpend
B
8

I'll give you my two cents about this because I've been wondering the same thing.

Glue

As per AWS documentation, AWS Glue is "Simple, scalable, and serverless data integration". Glue can be used for a variety of things: as a metadata repository, automatic schema discovery, code generation, and run ETL pipelines to prepare data. Glue takes care of providing and managing the computation resources needed to run your data pipelines. Glue is a serverless service, so you don't need to create and manage the infrastructure, because Glue does it for you.

If we focus only on the processing feature and discard the Glue-specific features (schema discovery, code generation, etc) then EMR Serverless and Glue services look almost identical. One of the key advantages of both services is the ability to run Spark or Hive serverless applications.

What advantage will EMR Serverless have over Glue Spark jobs?

To run Glue, you must either specify MaxCapacity (for Glue version 1.0 or earlier jobs) or Worker type and the Number of workers (for Glue version 2.0 jobs). Both options assume, first, that there is some understanding of the data and workload per cluster, and second, that the workload during job execution will be uniform, i.e., there will be no over- or under- utilization of the provisioned resources.

EMR Serverless

EMR Serverless is a new deployment option for AWS EMR. With EMR Serverless, you don't need to configure, optimize, protect, or manage clusters to run applications on these platforms. EMR Serverless helps you avoid over- or under-allocation of resources to process jobs at the individual stage level.

EMR Serverless automatically identifies the resources needed by jobs, provisions those resources to run the jobs, and releases them when the jobs are completed. In cases where applications require a response within seconds, such as interactive data analysis, the engineer can pre-initialize the necessary resources during application creation. This provides easy initialization, fast job startup, automatic capacity management, and simple cost control.

More info: https://luminousmen.com/post/emr-serverless-a-400level-guide

Boardman answered 24/5, 2022 at 5:17 Comment(1)
Great article too for further details. from the article: "Keep in mind that AWS Glue is more expensive than EMR Serverless for similar compute resources."Spacious
R
-1

AWS Glue is a data integration service and ETL. Completely different service than EMR Analytics.

AWS Glue can be used as metadata store (table schema) for EMR and run integration jobs to prepare data (e. g. for the EMR). There are are data integration jobs and workflows. At least that's the intention to make the jobs limited, but simpler to manage.

EMR is much more (and very different). In theory the EMR could as well run the python data integration jobs in batch on top of a Spark cluster, but you could run any jobs inside a Spark cluster. EMR is more an analytics tool and processing tool. It is not limited to Spark processing of python batch jobs, you can use different frameworks. Though EMR serverless docs mention only Spark and Hive queries, you have much more control over the processing job.

If anything compares to the EMR service, it's Athena, which is something like EMR serverless with Spark and Presto and on its own network.

Riddle answered 12/12, 2021 at 9:23 Comment(6)
I do not agree with this answer. AWS Glue is a data integration service - correct but one its key benefits is the Serverless Spark service or Python Shell jobs. The entire ETL component of AWS Glue, which, in addition to its unified data catalog, is one of its key selling points. With EMR Serverless, the ETL part of it is exactly fitting the same bill. What benefit will EMR serverless give over Glue Spark jobs?Itacolumite
@Itacolumite EMR Serverless gives more run-time options (Hive queries, Java jobs, Presto, ..), sizing options, .. Glue jobs should be limited for ETL (though in theory you may write anything in it). I wouldn't use the Glue jobs to create a response for a map-reduce result set. Yes the technology core may be the same/similar, but the use case is different. If you don't agree/like the answer, feel free to write better one.Riddle
I dont have an answer yet for it as this is a question I am searching for an answer yet. You are right in EMR having more runtime options but what i am not clear with serverless is whether it targets ETL as a service or is something more like an autoscaling EMR cluster that automatically scales up and down as per your workload. If it is the latter, it makes sense. If it is the former, it is still more or less moving towards what Glue does.Itacolumite
@Itacolumite EMR (serverless or not) is suppose to be an analytics tool (it technically can execute an ETL task, but it's not the purpose). So yes, I see that like an auto-scale transient EMR. The serverless option cannot support persistent tasks, so the streaming analytics or hbase cannot be supported.Riddle
auto-scale transient EMR and not supporting persistent tasks -> Reminds me of Glue. Also, why cannot streaming analytics be managed? Having continuous streaming events sent as Spark Streaming tasks to the serverless EMR server would actually make it a good candidate for streaming jobs in my opinion.Itacolumite
AWS Glue is a managed Spark cluster service. So is Amazon EMR Serverless (at least at the moment when this comment was written). So Glue and EMR Serverless very much compare to each other and not to Athena.Burgle

© 2022 - 2024 — McMap. All rights reserved.