What is the difference between AWS Glue ETL Job and AWS EMR?

Asked 7/6, 2020 at 20:19 Answered 13/9, 2024 at 6:39

Solved amazon-web-services amazon-s3 etl amazon-emr aws-glue

If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.

Circassia answered 7/6, 2020 at 20:19 Comment(0)

Most of the differences are already listed so I'll focus more on the use case specific.

When to choose aws glue

Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
You don't want the overhead of managing large cluster and pay only for what you use.

When to use EMR

Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
You believe only in the outputs and lineage is not required.
You need to define more memory per executor depending upon the type of your job and requirement.
You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.

So it depends on what your use case is. Both are great service.

Showroom answered 8/6, 2020 at 16:1 Comment(0)

Glue allows you to submit ETL scripts directly in PySpark/Python/Scala, without the need for managing an EMR cluster. All setup/tear-down of infrastructure is managed.

There are also a few other managed components like Crawlers, Glue Data Catalog, etc which make it easier to work on your data.

You could use either for your use-case, Glue would be faster however you may not have the flexibility you get with EMR.

Anabaptist answered 7/6, 2020 at 20:29 Comment(0)

Glue uses EMR under the hood. This is evident when you ssh into the driver of your Glue dev-endpoint.

Now since Glue is a managed spark environment or say managed EMR environment, it comes with reduced flexibility. The type of workers that you can chose is limited. The number of language libraries that you can use in your spark code is limited. Glue did not support packages like pandas, numpy until recently. Apps like presto cant be integrated with Glue although Athena is a good alternative to a separate presto installation.

The main issue however is that Glue jobs have a cold start time from anywhere between 1 minute to 15 minutes.

EMR is a good choice for exploratory data analysis but for a production environment with CI/CD, Glue seems to be the better choice.

EDIT - Glue jobs no longer have a cold start wait time

Letourneau answered 9/6, 2020 at 14:2 Comment(0)

From the AWS Glue FAQ:

AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs.

Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.

Source: https://aws.amazon.com/glue/faqs/

Enlargement answered 7/6, 2020 at 20:30 Comment(2)

Which is better for my use case? Say i need to do etl, and some sql queries as well? – Circassia 7/6, 2020 at 20:37

That’s really hard to say without knowing your specific use-case. But if you don’t want to take care of managing your own cluster and don’t need to use specific custom tools like Hive, AWS Glue is a great service. In addition to run pre-defined or custom ETL jobs, you can also use the Glue Crawler to derive the schema from your data and query it with SQL using Amazon Athena. – Enlargement 7/6, 2020 at 20:43

AWS Glue is a ETL service from AWS. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target

AWS EMR is a service where you can process large amount of data , its a supporting big data platform .It Supports Hadoop,Spark,Flink,Presto, Hive etc.You can spin up EC2 with the above listed softwares and make a similar ecosystem.

In your case , you want to process 1 TB of data .Now if you want do computations on the same data , you can use EMR and if you want to run the analytics on the transformed data , use Glue .

Casias answered 7/6, 2020 at 20:31 Comment(1)

The first part of your answer seems to confuse services. The AWS Data Migration Service (DMS) is used for that purpose, not AWS Glue. – Enlargement 7/6, 2020 at 20:34

Following is something that i compiled post working on analytics projects (though a lot of it depends on use case) - but generally speaking :

Criteria	Glue	EMR
Costs	Comparatively Costlier	Much Cheaper (Due to Spot Instance Functionality, There have been cases when there are saving of upto 50% over top-off glue costs - even more depending upon the use case)
Orchestration	Inbuilt (Glue WorkFlows & Triggers)	Through Cloud Watch Triggers & Step Functions
Infra Work Required	No Infra Setup - Select Worker Type However,Roles & Permissions are needed	Identify the Type of Node Needed & Setup Autoscaling rules etc
Cluster Resiliency & Robustness	Highly Resilient (AWS MANAGED)	If Spot Instances are used then interruption might occur with 2 min notification (Though the System Recovers Automatically - For eg - Job Times might elongate)
Skill Sets Needed	PySpark & Intermediate AWS Knowledge	DevOps to Setup EMR & Manage, Intermediate Knowledge of Orchestration via Cloud Watch & Step Function, PySpark
Applicable Use Cases	Attractive Option in event: 1. You are not worried about Costs but need highly resilient infra 2. Batch Setups wherein the Job might complete in fixed time 3. Short RealTime Streaming Jobs which need to run for let's say hrs during a day	1. Use Case is of Volatile Clusters - Mostly Used for Batch Processing (Day MINUS Scenarios) - Thereby making a costs effective solution for Batch Jobs 2. Attractive option for 24/7 Spark Streaming Programs 3. You Need a Hadoop Ecosystem & Related tools (like HDFS, HIVE, HUE, Impala etc) 4. You need to run Flink Programs etc 5. You need control over Infra & It's tuning parameters

Also going back to OP's use case of 1TB of data processing. If its one time processing Glue should suffice, if its a Daily Once Batch EMR & GLUE will both be good (depending on how job is tuned Glue can be an attractive option), if its a multiple time daily job - then EMR is a better option (Considering balance of performance and cost)

Tweeter answered 25/11, 2022 at 8:37 Comment(0)

Using EMR for big data and Hadoop Clusters while in Glue you're simply talking about ETL in general, AWS Glue is a good tool for replacement of SSIS and other ETL tools if your infrastructure will move to AWS.

AWS EMR:

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyse vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html

AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data ops tooling for authoring, running jobs, and implementing business workflows.

AWS Glue:

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Laverne answered 13/9, 2024 at 6:39 Comment(0)

Recommended topics

Hot tags