How to store custom Parquet Dataset metadata with pyarrow? - McMap

About

How to store custom Parquet Dataset metadata with pyarrow?

Asked 10/9, 2021 at 11:10 Answered 12/10, 2021 at 16:5

python parquet pyarrow

H

1

7

How do I store custom metadata to a ParquetDataset using pyarrow?

For example, if I create a Parquet dataset using Dask

import dask
dask.datasets.timeseries().to_parquet('temp.parq')

I can then read it using pyarrow

import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')

However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset, since there is no replace_schema_metadata function or similar.

I think I would probably like to write a custom _custom_metadata file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:

meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = { b'type': b'mydataset' }
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')

Huggins answered 10/9, 2021 at 11:10 Comment(2)

You can convert the Parquet schema to an Arrow schema (dataset.schema.to_arrow_schema()), and pass that to pq.write_metadata? Any metadata set in the Arrow schema will be preserved in the Parquet FileMetaData. – Criollo 10/9, 2021 at 14:14

@Criollo thank you, that was indeed helpful, however, I think my original question was a bit misleading. I have now updated it with a hopefully clearer description of my problem. – Huggins 21/9, 2021 at 7:32

H

2

One possibility (that does not directly answer the question) is to use dask.

import dask

# Sample data
df = dask.datasets.timeseries()

df.to_parquet('test.parq', custom_metadata={'mymeta': 'myvalue'})

Dask does this by writing the metadata to all the files in the directory, including _common_metadata and _metadata.

from pathlib import Path
import pyarrow.parquet as pq

files = Path('test.parq').glob('*')

all([b'mymeta' in pq.ParquetFile(file).metadata.metadata for file in files])
# True

Huggins answered 12/10, 2021 at 16:5 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.