How to save a pandas DataFrame with custom types using pyarrow and parquet
Asked Answered
F

2

15

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).

Throughout the examples we use:

import pandas as pd
import pyarrow as pa

Here's a minimal example to show the situation:

df = pd.DataFrame(
    [
        {'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
        {'name': 'bob',   'oid': ObjectId('5e9992543bfddb58073803e8')},
    ]
)

df.to_parquet('some_path')

And we get:

ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')

I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html

Thus I wrote the following type extension:

class ObjectIdType(pa.ExtensionType):

    def __init__(self):
        pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")

    def __arrow_ext_serialize__(self):
        # since we don't have a parametrized type, we don't need extra
        # metadata to be deserialized
        return b''

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return ObjectId()

And was able to get a working pyarray for my oid column:

values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)

Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.

How can I simply create parquet file from dataframes and restore them while transparently converting the types ?

I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.

table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():

    if isinstance(values.iloc[0], ObjectId):
        arr = pa.array(
            values.map(lambda oid: oid.binary), type=pa.binary(12)
        )

    elif isinstance(values.iloc[0], ...):
        ...

    else:
        arr = pa.array(values)

    table.append_column(arr, col)  # FAILS (wrong type)

Pseudocode of the ideal solution:

parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)

assert df.equals(new_df)  # types have been correctly restored

I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.

Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.

Facet answered 17/4, 2020 at 12:8 Comment(2)
I can contribute that at the least your extension class is missing a requisite function for conversion from pandas, to_pandas_dtype, see this section in detail arrow.apache.org/docs/python/…. I still didn't manage to get to a working solution though (I was attempting for UUIDs).Carmine
Why not pickle the dataframe and store it? You can unpickle it to get the exact copy.Lesbos
V
2

I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.

I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.

Check below the steps:

df = pd.DataFrame(
    [
        {'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
        {'name': 'bob',   'oid': "ObjectId('5e9992543bfddb58073803e8')"},
    ]
)

df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1

output:

    name    oid
0   alice   ObjectId('5e9992543bfddb58073803e7')
1   bob ObjectId('5e9992543bfddb58073803e8')

Voltmer answered 28/6, 2021 at 13:47 Comment(0)
M
-2

You could write a method that reads the column names and types and outputs a new DF with the columns converted to compatible types, using a switch-case pattern to choose what type to convert column to (or whether to leave it as is).

Marshy answered 6/4, 2021 at 11:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.