How to define nested array to ingest data and convert?
Asked Answered
M

1

7

I am using Firehose and Glue to ingest data and convert JSON to the parquet file in S3.

I was successful to achieve it with normal JSON (not nested or array). But I am failed for a nested JSON array. What I have done:

the JSON structure

{
    "class_id": "test0001",
    "students": [{
        "student_id": "xxxx",
        "student_name": "AAAABBBCCC",
        "student_gpa": 123
    }]
}

the Glue schema

  1. class_id : string
  2. students : array ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>

I receive error:

The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>' but 'ARRAY' is found.

Any suggestion is appreciated.

Minutiae answered 8/11, 2019 at 14:45 Comment(2)
Write custom classifier for JSON. Check docs.aws.amazon.com/glue/latest/dg/… for detailsConfidante
Any solution @franco phong ?Kannada
S
11

I ran into that because I created schemas manually in the AWS console. The problem is, that it shows some help text next to form to enter your nested data which capitalizes everything, but Parquet can only work with lowercase definitions.

Write despite the example given by AWS:

array<struct<student_id:string,student_name:string,student_gpa:int>>
Sawfly answered 17/12, 2019 at 18:28 Comment(1)
I just spent hours banging my head because of this.... thanks!Xavier

© 2022 - 2024 — McMap. All rights reserved.