How to read the arrow parquet key value metadata?
Asked Answered
H

1

5

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata.

How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow documentation site.

The metadata is a key-value pair that looks like this

key: "ARROW:schema"

value: "/////5AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAEAAAAyP///wQAAAABAAAAFAAAABAAGAAIAAYABwAMABAAFAAQAAAAAAABBUAAAAA4AAAAEAAAACgAAAAIAAgAAAAEAAgAAAAMAAAACAAMAAgABwA…

as a result of writing this in R

df = data.frame(a = factor(c(1, 2)))
arrow::write_parquet(df, "c:/scratch/abc.parquet")
Haematozoon answered 10/5, 2020 at 4:26 Comment(0)
M
7

The schema is base64-encoded flatbuffer data. You can read the schema in Python using the following code:

import base64
import pyarrow as pa
import pyarrow.parquet as pq

meta = pq.read_metadata(filename)
decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))
Machicolate answered 11/5, 2020 at 13:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.