How to get list of all columns from a parquet file using s3 select?
Asked Answered
S

2

7

I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows wihtout any column headers.

Is there anyway to get all column names from this parquet file without downloading it completely? Since parquet file can be very large, I would not want to download the entire parquet file which is why I am using s3 select to pick first few rows using

select * from S3Object LIMIT 10

I tried to fetch column names explicitly by doing

SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'S3Object'

but it would not work as AWS S3 doesn't support this yet.

Is there any other way to achieve the same?

Slider answered 11/8, 2019 at 16:4 Comment(2)
You can achive the parquet select with the aws s3 select feature. for java usage here is the link. linkCiracirca
That's what I am doing. This itself is not returning the header if I use parquet file.Slider
P
4

I have the same problem but unfortunately, my Google-Kung-Fu was not strong enough this time.

I found the following workaround which I don't really like but it works for me:

r = s3.select_object_content(Bucket='...your bucket...', 
                             Key='...your key...', 
                             ExpressionType='SQL', 
                             Expression="select s.* from S3Object s limit 1",
                             InputSerialization={'Parquet': {}}, 
                              OutputSerialization={'JSON': {}})
row = json.loads([rec['Records']['Payload'].decode('utf-8') for rec in r['Payload'] if 'Records' in rec][0])

print("Columns: ", list(row.keys()))

I.e. what the code does is to request the first line of the data, to extract the payload and to load the returned JSON object. The obtained JSON object has the following structure {"Column name": "value", ....}, such that one only has to extract the keys of the JSON object (last line).

One additional problem is that this does not return the types of columns. This is something that I could not solve yet.

UPDATE: I observed that in some situations, some column names were not correct reported. Instead of the real name something like _18, _19 were returned. No idea how to deal with it.

Pogey answered 4/12, 2019 at 20:35 Comment(0)
R
1

You could retrieve just a row by selecting json format instead csv and you will obtain the column names as labels in json document; then you can use them to query the object

Repartition answered 6/6, 2023 at 10:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.