The pyarrow documentation builds a custom UUID type many times like this:
import pyarrow as pa
class UuidType(pa.PyExtensionType):
def __init__(self):
pa.PyExtensionType.__init__(self, pa.binary(16))
def __reduce__(self):
return UuidType, ()
Coincidentally(?), there is a UUID logical type in parquet. I would like to be able to take a pyarrow table with UUIDs and write it to parquet, and have it specified as the UUID logical type.
This was asked and answered before How to specify logical types when writing Parquet files from PyArrow? in the context of fixed-length integers, and the solution was to update the version number and the problem solved itself. I would like to know how to specify the logical type manually, but I would also accept an answer that just makes it work for UUIDs.
Using the above class, I can try to build a table and write it.
from uuid import uuid4
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
uuid_bytes = [u.bytes for u in [uuid4() for _ in range(100)]]
df = pd.DataFrame({'uuid_column': uuid_bytes})
schema = pa.schema([('uuid_column', UuidType())])
table = pa.Table.from_pandas(df, schema=schema)
with pq.ParquetWriter('uuid_data.parquet', schema) as writer:
writer.write_table(table)
Printing the file schema with ParquetFile
or inspecting it with parquet-tools
shows that I have a file where a data set is fixed_len_byte_array(16)
, but it does not have the UUID
logical type.
UuidType
in the pyarrow documentation has changed due to a security issue. – Cigarillo