How to define Parquet and/or Arrow schemas?
Asked Answered
O

2

5

Is there a language agnostic way of representing a Parquet or Arrow schema in a similar way to Avro? For example, an Avro schema might look like this:

{
     "type": "record",
     "namespace": "com.example",
     "name": "FullName",
     "fields": [
       { "name": "first", "type": "string" },
       { "name": "last", "type": "string" }
     ]
} 

Is there something similar for Parquet and/or Arrow?

(I’d write this as 2 separate questions but Stack Overflow flags them as dupes, sorry).

Osmunda answered 4/1 at 23:48 Comment(0)
S
5

If you are looking for a canonical text representation of the schemas that you can pass to a libraries parse function and return the corresponding object, then there is no such universal representation.

For more context the representation of Arrow schemas for serialization is defined via Flatbuffer. While Flatbuffers does have a plain text representation, to my knowledge I don't think any implementations of Arrow provide functionality to parse it or serialize to it out of the box (there was a effort to support this in C++/Python a while ago but I believe it stalled). For testing purposes arrow has also defined a JSON representation of schemas but this is not considered canonical.

Parquet's schemas are serialized as Thrift in a depth first traversal of the schema. Similar to Arrow, I'm not aware of bindings that will take a parquet schema as JSON protocol and convert it to library objects in the target language. Also given the representation is a depth-first traversal of the schema it would be fairly unfriendly for humans to read/write.

Seafowl answered 5/1 at 17:58 Comment(0)
P
4

If it's OK for the schema to be represented in a non-human-readable binary format, then yes it is possible to do this with the Arrow IPC encapsulated message format. For example:

In Python, create and serialize an Arrow schema and write it to a file:

import pyarrow as pa

schema = pa.schema([
    ('field1', pa.int32()),
    ('field2', pa.float64()),
    ('field3', pa.string())
])

with open('schema.arrow', 'wb') as file:
    file.write(schema.serialize())

In R, read the file and deserialize the schema:

library(arrow)

buf <- ReadableFile$create("schema.arrow")$Read()
msg <- read_message(buf)
sch <- read_schema(msg)

print(sch)
## Schema
## field1: int32
## field2: double
## field3: string
Parvenu answered 5/1 at 21:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.