How to specify logical types when writing Parquet files from PyArrow?
Asked Answered
P

1

3

I'm using PyArrow to write Parquet files from some Pandas dataframes in Python.

Is there a way that I can specify the logical types that are written to the parquet file?

For for example, writing an np.uint32 column in PyArrow results in an INT64 column in the parquet file, whereas writing the same using the fastparquet module results in an INT32 column with a logical type of UINT_32 (this is the behaviour I'm after from PyArrow).

E.g.:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp
import numpy as np

df = pd.DataFrame.from_records(data=[(1, 'foo'), (2, 'bar')], columns=['id', 'name'])
df['id'] = df['id'].astype(np.uint32)

# write parquet file using PyArrow
pq.write_table(pa.Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet')

# write parquet file using fastparquet
fp.write('fastparquet.parquet', df)

# print schemas of both written files
print('PyArrow:', pq.ParquetFile('pyarrow.parquet').schema)
print('fastparquet:', pq.ParquetFile('fastparquet.parquet').schema)

this outputs:

PyArrow: <pyarrow._parquet.ParquetSchema object at 0x10ecf9048>
id: INT64
name: BYTE_ARRAY UTF8

fastparquet: <pyarrow._parquet.ParquetSchema object at 0x10f322848>
id: INT32 UINT_32
name: BYTE_ARRAY UTF8

I'm having similar issues with other column types, so really looking for a generic way to specify the logical types that are used when writing using PyArrow.

Patina answered 8/3, 2018 at 11:51 Comment(2)
Just out of interest, is there a benefit to writing files directly through pyarrow (ie rather than using pd.to_parquet) ?Reggiereggis
@Reggiereggis not that I know of, just happen to be doing other lower level things with pyarrow already, so was easier to do all the writing via that rather than Pandas.Patina
P
3

PyArrow defaults to writing parquet version 1.0 files by default, and version 2.0 is needed to use the UINT_32 logical type.

The solution is to specify the version when writing the table, i.e.

pq.write_table(pa.Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet', version='2.0')

This then results in the expected parquet schema being written.

Patina answered 8/3, 2018 at 16:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.