I'm using PyArrow to write Parquet files from some Pandas dataframes in Python.
Is there a way that I can specify the logical types that are written to the parquet file?
For for example, writing an np.uint32
column in PyArrow results in an INT64 column in the parquet file, whereas writing the same using the fastparquet module results in an INT32 column with a logical type of UINT_32 (this is the behaviour I'm after from PyArrow).
E.g.:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp
import numpy as np
df = pd.DataFrame.from_records(data=[(1, 'foo'), (2, 'bar')], columns=['id', 'name'])
df['id'] = df['id'].astype(np.uint32)
# write parquet file using PyArrow
pq.write_table(pa.Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet')
# write parquet file using fastparquet
fp.write('fastparquet.parquet', df)
# print schemas of both written files
print('PyArrow:', pq.ParquetFile('pyarrow.parquet').schema)
print('fastparquet:', pq.ParquetFile('fastparquet.parquet').schema)
this outputs:
PyArrow: <pyarrow._parquet.ParquetSchema object at 0x10ecf9048>
id: INT64
name: BYTE_ARRAY UTF8
fastparquet: <pyarrow._parquet.ParquetSchema object at 0x10f322848>
id: INT32 UINT_32
name: BYTE_ARRAY UTF8
I'm having similar issues with other column types, so really looking for a generic way to specify the logical types that are used when writing using PyArrow.
pd.to_parquet
) ? – Reggiereggis