Specify a SerDe serialization lib with AWS Glue Crawler
Asked Answered
O

3

11

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in)

enter image description here

I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde.

I've tried making my own csv Classifier but that doesn't help.

How do I get the crawler to specify a particular serialization lib for the tables produced or updated?

Obovoid answered 14/8, 2019 at 16:4 Comment(4)
Can you confirm if your data has double quotes in it? If yes crawler should populate table with OpenCSVSerde when custom CSV classifier used.Can you share the custom csv classifier config that you have used? In my case crawler populated table with OpenCSVSerdeGrandchild
@Prabhakar The data does have does have double quotes for some rows, but not all. The ones with newlines in are double quoted. Do you think that editing the data so that all lines are double quoted is a solution? The custom classifier was just created with the Glue web interface; there don't seem to be many customization options there.Obovoid
@RhysJonesa.k.a.Luigi Did you find a solution for this other than updating the table definition either manually or programmatically?Powers
@Powers Not a solution but we turned off the glue crawlers from auto-running since schema changes are rare. If you find a solution, let me know!Obovoid
C
9

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

  1. Create a Glue Crawler with the following configuration.

    Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

    Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.

  2. Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".

  3. Re-run the crawler.

  4. In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.

  5. You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.

Cure answered 13/8, 2020 at 15:52 Comment(6)
Thank you! In my case I also had to delete all table fields between step 2 and 3 for it to detect fields properly for some reason.Repetition
I ran into this problem.. source has double-quotes in all text in a CSV source file. The crawler gives me an appropriate table but queries from both Athena and Redshift show the double-quotes in strings. The workaround solves the problem, I can edit the table and set the Serde type and the queries stop showing the double-quotes. But I need to make this a property of the Glue Crawler since the table is dynamically created. All of my code is deployed via Terraform. Does Glue have a solution yet so that it will create tables with the correct formatting? (e.g. drop the double-quotes!).Mask
As of 2021/06 The option to specify the serialization lib at the crawler level does not appear to exist.Cure
This workaround works for me (I don't have to delete table fields as mention by @crea1). Just to add some more notes here: In step 2, don't forget to add "separatorChar" under "Serde parameters" if you are using a delimiter that's not ",". The purpose for step 3 is to update all existing partitions so that they all follow the serde setup in the table definition. In the end, as of 2021-11-06, the option to specify the serialization lib at the crawler level does not appear to exist.Passed
2 years 6 months later, the option still does not exist, even in Glue 3. The workaround isn't suitable for automation: I can't edit the dozen of crawlers manually in production.Leu
@Jérémy that's exactly why i have decided to stop using crawlers. it's basically an un-maintainable inconsistent black box.Passed
I
2

AWS documents have a suggestion to programmatically modify the table by using the Update Table API. That is still a wonky workaround for a Glue crawler using a CSV classifier to handle this problem. To add insult to injury, the crawler also appears to overwrite the table Serde serialization lib values back to "LazySimpleSerDe" when it is rerun.

Insalubrious answered 4/4, 2022 at 19:46 Comment(0)
C
0
  1. run the crawler
  2. go to tables in gluecatalog
  3. select table and click on edit table details
  4. there you find serde lib option change it to org.apache.hadoop.hive.serde2.OpenCSVSerde and just refresh your athena screen and new serde will be applied
Cherimoya answered 14/7, 2022 at 17:30 Comment(1)
If you look at the question, that's exactly what I already didObovoid

© 2022 - 2024 — McMap. All rights reserved.