AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema
Asked Answered
L

1

8

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression.

I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and int values.

With those columns, I created Schema like:

create external table (
    key int,
    result double,
    location string,
    vehicle_name string.
    filename string
)

When I queried the table, I would get

HIVE_BAD_DATA: Field results type INT64 in parquet is incompatible with type DOUBLE defined in table schema

So, I modified a schema with result datatype as INT.

Then I queried the table and got,

HIVE_BAD_DATA: Field results type DOUBLE in parquet is incompatible with type INT defined in table schema

I've looked around to try to understand why this might happen but found no solution.

Any suggestion is much appreciated.

Leonardo answered 22/5, 2020 at 6:9 Comment(2)
Did you try defining them as String and casting them to number on query time?Piecrust
you can check the schema of the parquet files and use the same data types when you create the Athena table.Unpolled
Y
13

It sounds to me like you have some files where the column is typed as double and some where it is typed as int. When you type the column of the table as double Athena will eventually read a file where the corresponding column is int and throw this error, and vice versa if you type the table column as int.

Athena doesn't do type coercion as far as I can tell, but even if it did, the types are not compatible: a DOUBLE column in Athena can't represent all possible values of a Parquet INT64 column, and an INT column in Athena can't represent a floating point number (and a BIGINT column is required in Athena for a Parquet INT64).

The solution is to make sure your files all have the same schema. You probably need to be explicit in the code that produces the files about what schema to produce (e.g. make it always use DOUBLE).

Youngling answered 27/5, 2020 at 9:32 Comment(2)
I wonder if there's a way to salvage data that has been saved in multiple formats like this. Maybe some way on the query side of things to make it more permissive or consider all versions of the Glue table before saying a file can't be read. Any information about how to still get use from data like this would be great.Notate
@Notate you would' have to read it file by file using Spark and cast it into the type you want. It's a bit of programming, but it's not that bad. You can use S3 SDK to list objects, load them, and transform them one by one.Explanation

© 2022 - 2024 — McMap. All rights reserved.