PyArrow: Store list of dicts in parquet using nested types

Asked 21/2, 2019 at 22:7 Answered 16/11, 2020 at 19:50

I want to store the following pandas data frame in a parquet file using PyArrow:

import pandas as pd
df = pd.DataFrame({'field': [[{}, {}]]})

The type of the field column is list of dicts:

      field
0  [{}, {}]

I first define the corresponding PyArrow schema:

import pyarrow as pa
schema = pa.schema([pa.field('field', pa.list_(pa.struct([])))])

Then I use from_pandas():

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

This throws the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "table.pxi", line 930, in pyarrow.lib.Table.from_pandas
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 371, in dataframe_to_arrays
    convert_types)]
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in <listcomp>
    for c, t in zip(columns_to_convert,
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 366, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "array.pxi", line 177, in pyarrow.lib.array
  File "error.pxi", line 77, in pyarrow.lib.check_status
  File "error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unknown list item type: struct<>

Am I doing something wrong or is this not supported by PyArrow?

I use pyarrow 0.9.0, pandas 23.4, python 3.6.

Yurt answered 21/2, 2019 at 22:7 Comment(0)

According to this Jira issue, reading and writing nested Parquet data with a mix of struct and list nesting levels was implemented in version 2.0.0.

The following example demonstrates the implemented functionality by doing a round trip: pandas data frame -> parquet file -> pandas data frame. PyArrow version used is 3.0.0.

The initial pandas data frame has one filed of type list of dicts and one entry:

                  field
0  [{'a': 1}, {'a': 2}]

Example code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet

df = pd.DataFrame({'field': [[{'a': 1}, {'a': 2}]]})
schema = pa.schema(
    [pa.field('field', pa.list_(pa.struct([('a', pa.int64())])))])
table_write = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
pyarrow.parquet.write_table(table_write, 'test.parquet')
table_read = pyarrow.parquet.read_table('test.parquet')
table_read.to_pandas()

The output data frame is the same as the input data frame, as it should be:

                  field
0  [{'a': 1}, {'a': 2}]

Yurt answered 23/10, 2019 at 13:12 Comment(1)

There were some bugs in 2.0.0 for writing, 3.0.0 has not had any reported bugs against for writing. – Rodin 16/4, 2021 at 3:56

Here is a snippet to reproduce this bug:

#!/usr/bin/env python3
import pandas as pd  # type: ignore


def main():
    """Main function"""
    df = pd.DataFrame()
    df["nested"] = [[dict()] for i in range(10)]

    df.to_feather("test.feather")
    print("Success once")
    df = pd.read_feather("test.feather")
    df.to_feather("test.feather")


if __name__ == "__main__":
    main()

Note that from pandas to feather, nothing breaks, but once the dataframe is loaded from feather and one tries to write back to it, then it does break.

To solve this, just update to pyarrow 2.0.0:

pip3 install pyarrow==2.0.0

Available pyarrow versions as of 2020-11-16:

0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.17.1, 1.0.0, 1.0.1, 2.0.0

Potiche answered 16/11, 2020 at 19:50 Comment(0)

-2

I've been able to save pandas dataframes that have arrays in columns as parquet and read them back from parquet to dataframes by converting the dataframe dtypes of object to str.

def mapTypes(x):
    return {'object': 'str', 'int64': 'int64', 'float64': 'float64', 'bool': 'bool',
            'datetime64[ns, ' + timezone + ']': 'datetime64[ns, ' + timezone + ']'}.get(x,"str")  # string is     default if type not mapped

table_names = [x for x in df.columns]
table_types = [mapTypes(x.name) for x in df.dtypes]
parquet_table = dict(zip(table_names, table_types))    
df_pq = df.astype(parquet_table)
import awswrangler as wr

wr.s3.to_parquet(df=df_pq,path=path,dataset=True,database='test',mode='overwrite',table=table.lower(),partition_cols=['realmid'],sanitize_columns=True)

pic below shows reading from parquet file stored in s3 to dataframe using the AWS datawrangler library, i've also done this with pyarrow

Moorish answered 23/6, 2020 at 18:41 Comment(0)

Recommended topics

Hot tags