I have a Kinesis Firehose configuration in Terraform, which reads data from Kinesis stream in JSON, converts it to Parquet using Glue and writes to S3. There is something wrong with data format conversion and I am getting the below error(with some details removed):
{"attemptsMade":1,"arrivalTimestamp":1624541721545,"lastErrorCode":"DataFormatConversion.InvalidSchema","lastErrorMessage":"The schema is invalid. The specified table has no columns.","attemptEndingTimestamp":1624542026951,"rawData":"xx","sequenceNumber":"xx","subSequenceNumber":null,"dataCatalogTable":{"catalogId":null,"databaseName":"db_name","tableName":"table_name","region":null,"versionId":"LATEST","roleArn":"xx"}}
The Terraform configuration for Glue Table, I am using, is as follows:
resource "aws_glue_catalog_table" "stream_format_conversion_table" {
name = "${var.resource_prefix}-parquet-conversion-table"
database_name = aws_glue_catalog_database.stream_format_conversion_db.name
table_type = "EXTERNAL_TABLE"
parameters = {
EXTERNAL = "TRUE"
"parquet.compression" = "SNAPPY"
}
storage_descriptor {
location = "s3://${element(split(":", var.bucket_arn), 5)}/"
input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
ser_de_info {
name = "my-stream"
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
parameters = {
"serialization.format" = 1
}
}
columns {
name = "metadata"
type = "struct<tenantId:string,env:string,eventType:string,eventTimeStamp:timestamp>"
}
columns {
name = "eventpayload"
type = "struct<operation:string,timestamp:timestamp,user_name:string,user_id:int,user_email:string,batch_id:string,initiator_id:string,initiator_email:string,payload:string>"
}
}
}
What needs to change here?