Can AWS Glue crawl Delta Lake table data?
Asked Answered
A

6

10

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?

Availability answered 2/10, 2019 at 6:0 Comment(2)
Did you get this resolved yet ?Tressatressia
@Availability sorry, I may be out of context. Why do you need to crawl Delta table? IMHO crawlers needed for schemaless formats, while Delta Lake has built-in parquet schema plus its history and some advanced schema evolution options.Acreage
V
3

It is finally possible to use AWS Glue Crawlers to detect and catalog Delta Tables.

Here is a blog post explaining how to do it.

Votyak answered 14/9, 2022 at 14:5 Comment(2)
Yes, you are correct, I came across this article as well.Availability
Yea, this is really legit!Tripp
T
4

This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.

As per the doc, it says below :

Warning

Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

Tressatressia answered 6/9, 2020 at 6:41 Comment(0)
V
3

It is finally possible to use AWS Glue Crawlers to detect and catalog Delta Tables.

Here is a blog post explaining how to do it.

Votyak answered 14/9, 2022 at 14:5 Comment(2)
Yes, you are correct, I came across this article as well.Availability
Yea, this is really legit!Tripp
C
2

I am currently using a solution to generate manifests of Delta tables using Apache Spark (https://docs.delta.io/latest/presto-integration.html#language-python).

I generate a manifest file for each Delta Table using:

deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.

CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/'  -- location of 
the generated manifest
Caste answered 15/3, 2022 at 15:5 Comment(0)
S
0

It would be better if you could clarify what do you mean by saying "integrate delta lake with AWS Glue"..

At this moment, there is no direct Glue API for Delta lake support, however, you could write customized code using delta lake library to save output as a Delta lake.

To use Crawler to add meta of Delta lakes to Catalog, here is a workaround . The workaround is not pretty and has two major parts.

1) Get the manifest of referenced files of the Delta Lake. You could refer to Delta Lake source code, or play with the logs in _delta_log, or use a brutal method such as

import org.apache.spark.sql.functions.input_file_name

spark.read.format("delta")
  .load(<path-to-delta-lake>)
  .select(input_file_name)
  .distinct

2) Use Scala or Python Glue API and the manifest to create or update table in Catalog.

Scurry answered 9/10, 2019 at 20:48 Comment(0)
P
0

AWS Glue Crawler allows us to update metadata from delta table transaction logs to Glue metastore. Ref - https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-delta-lake

But there are a few downsides to it -

  • It creates a symlink table in Glue metastore
  • This symlink-based approach wouldn't work well in case of multiple versions of the table, since the manifest file would point to the latest version
  • There is no identifier in glue metadata to identify if given table is Delta, in case you have different types of tables in your metastore
  • Any execution engine which access delta table via manifest files, wouldn't be utilizing other auxiliary data in transaction logs like column stats
Pustulate answered 30/11, 2022 at 5:28 Comment(0)
A
0

Yes it is possible but only recently.

See the attached AWS Blog entry for details on this just announced capability.

https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

Adulate answered 21/12, 2022 at 19:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.