Specify a SerDe serialization lib with AWS Glue Crawler

Asked 14/8, 2019 at 16:4 Answered 14/7, 2022 at 17:30

amazon-web-services amazon-athena aws-glue aws-glue-data-catalog

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in)

I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde.

I've tried making my own csv Classifier but that doesn't help.

How do I get the crawler to specify a particular serialization lib for the tables produced or updated?

Obovoid answered 14/8, 2019 at 16:4 Comment(4)

Can you confirm if your data has double quotes in it? If yes crawler should populate table with OpenCSVSerde when custom CSV classifier used.Can you share the custom csv classifier config that you have used? In my case crawler populated table with OpenCSVSerde – Grandchild 16/8, 2019 at 2:38

@Prabhakar The data does have does have double quotes for some rows, but not all. The ones with newlines in are double quoted. Do you think that editing the data so that all lines are double quoted is a solution? The custom classifier was just created with the Glue web interface; there don't seem to be many customization options there. – Obovoid 19/8, 2019 at 16:16

@RhysJonesa.k.a.Luigi Did you find a solution for this other than updating the table definition either manually or programmatically? – Powers 3/9, 2019 at 7:5

@Powers Not a solution but we turned off the glue crawlers from auto-running since schema changes are rare. If you find a solution, let me know! – Obovoid 4/9, 2019 at 8:32

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

Create a Glue Crawler with the following configuration.

Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.
Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".
Re-run the crawler.
In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.
You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.

Cure answered 13/8, 2020 at 15:52 Comment(6)

Thank you! In my case I also had to delete all table fields between step 2 and 3 for it to detect fields properly for some reason. – Repetition 24/9, 2020 at 7:30

I ran into this problem.. source has double-quotes in all text in a CSV source file. The crawler gives me an appropriate table but queries from both Athena and Redshift show the double-quotes in strings. The workaround solves the problem, I can edit the table and set the Serde type and the queries stop showing the double-quotes. But I need to make this a property of the Glue Crawler since the table is dynamically created. All of my code is deployed via Terraform. Does Glue have a solution yet so that it will create tables with the correct formatting? (e.g. drop the double-quotes!). – Mask 24/2, 2021 at 23:16

As of 2021/06 The option to specify the serialization lib at the crawler level does not appear to exist. – Cure 28/6, 2021 at 13:51

This workaround works for me (I don't have to delete table fields as mention by @crea1). Just to add some more notes here: In step 2, don't forget to add "separatorChar" under "Serde parameters" if you are using a delimiter that's not ",". The purpose for step 3 is to update all existing partitions so that they all follow the serde setup in the table definition. In the end, as of 2021-11-06, the option to specify the serialization lib at the crawler level does not appear to exist. – Passed 6/11, 2021 at 4:8

2 years 6 months later, the option still does not exist, even in Glue 3. The workaround isn't suitable for automation: I can't edit the dozen of crawlers manually in production. – Leu 1/3, 2022 at 14:8

@Jérémy that's exactly why i have decided to stop using crawlers. it's basically an un-maintainable inconsistent black box. – Passed 2/3, 2022 at 15:38

AWS documents have a suggestion to programmatically modify the table by using the Update Table API. That is still a wonky workaround for a Glue crawler using a CSV classifier to handle this problem. To add insult to injury, the crawler also appears to overwrite the table Serde serialization lib values back to "LazySimpleSerDe" when it is rerun.

Insalubrious answered 4/4, 2022 at 19:46 Comment(0)

run the crawler
go to tables in gluecatalog
select table and click on edit table details
there you find serde lib option change it to org.apache.hadoop.hive.serde2.OpenCSVSerde and just refresh your athena screen and new serde will be applied

Cherimoya answered 14/7, 2022 at 17:30 Comment(1)

If you look at the question, that's exactly what I already did – Obovoid 15/7, 2022 at 18:46

Recommended topics

Hot tags