How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')
Asked Answered
H

5

43

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?

At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'

My use case is:

  • Partitions represent days
  • Files represent events
  • Each event is a json blob in a single s3 file
  • An event contains a subset of columns (dependent on the type of event)
  • The 'schema' of the entire table is the full set of columns for all the event types (this is correctly put together by Glue crawler)
  • The 'schema' of each partition is the subset of columns for the event types that occurred on that day (hence in Glue each partition potentially has a different subset of columns from the table schema)
  • This inconsistency causes the error in Athena I think

If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.

Thanks in advance!

Hymn answered 15/9, 2017 at 13:44 Comment(3)
I have the same issue where Partition schemas are different in column order. Its the same data but he is not smart enough to just apply the table schema correctly. Did you solve the issue?Literatim
No, unfortunately, I didn't solve this one. In terms of column order, you could try writing as newline separated JSON. It's not ideal but I think if you gzip the files it won't make a huge size difference to size. Good luck and let me know if you learn anything!Hymn
Defining several crawlers (one per partition) is not a solution?Catenate
D
69

I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:

enter image description here

Deliladelilah answered 7/2, 2018 at 13:16 Comment(2)
I spent 15 minutes typing a question and just before sending it this one came as recommended to check. This saved me hours! Thank youTrapezohedron
Helped me as well. Thanks. I was messing with this issue for one whole day... Didn't realise that we have this under Advance Options.Cynic
D
5

This helped me. Posting the image for others in case the link is lost enter image description here

Darnell answered 4/5, 2018 at 14:42 Comment(0)
E
5

It also fixed my issue! If somebody need to provision This Configuration Crawler with Terraform so here is how I did it:

resource "aws_glue_crawler" "crawler-s3-rawdata" {
  database_name = "my_glue_database"
  name          = "my_crawler"
  role          = "my_iam_role.arn"

  configuration = <<EOF
{
   "Version": 1.0,
   "CrawlerOutput": {
      "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
   }
}
EOF
  s3_target {
    path = "s3://mybucket"
  }
}
Edlun answered 21/6, 2019 at 0:41 Comment(0)
R
0

Despite selecting Update all new and existing partitions with metadata from the table. in the crawler's configuration, it still occasionally failed to set the expected parameters for all partitions (specifically jsonPath wasn't inherited from the table's properties in my case).

As suggested in https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html, "to drop the partition that is causing the error and recreate it" helped

After dropping the problematic partitions, glue crawler re-created them correctly on the following run

Rolo answered 6/11, 2020 at 18:37 Comment(0)
W
0

If you want to resolve this issue with CDK code, here's the example:

const crawler = new CfnCrawler(scope, name, {
    name: name,
    description: "Glue crawler to fetch CloudWatch metrics data",
    role: role.roleArn,
    targets: {s3Targets: [{path: 's3://' + service.bucket + '/'},],},
    schedule: {scheduleExpression: 'cron(0 * * * ? *)'},
    databaseName: this.databaseName,
    recrawlPolicy: {recrawlBehavior: 'CRAWL_NEW_FOLDERS_ONLY',},
    // Prevent the crawler from changing an existing schema
    // https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent
    configuration: '{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }',
    schemaChangePolicy: {deleteBehavior: 'LOG', updateBehavior: 'LOG'},
});

When you configure the crawler using the API, set the following parameters:

  • Set the UpdateBehavior field in SchemaChangePolicy structure to LOG.
  • Set the Configuration field with a string representation of the following JSON object in the crawler API; for example.

Reference: https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent

Waaf answered 26/2 at 9:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.