Can I store a Parquet file with a dictionary column having mixed types in their values?

Asked 5/8, 2020 at 16:42 Answered 26/4, 2022 at 19:17

Solved python pandas dataframe parquet pyarrow

I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such:

import pandas as pandas

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": "Value3" }
    ]
})

df.to_parquet("test.parquet")

Now, that works perfectly fine, the problem is when one of the nested values of the dictionary has a different type than the rest. For instance:

import pandas as pandas

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": ["Value3"] }
    ]
})

df.to_parquet("test.parquet")

This throws the following error:

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column ColC with type object')

Notice how, for the last row of the DF, the Field property of the ColC dictionary is a list instead of a string.

Is there any workaround to be able to store this DF as a Parquet file?

Latanya answered 5/8, 2020 at 16:42 Comment(0)

ColC is a UDT (user defined type) with one field called Field of type Union of String, List of String.

In theory arrow supports it, but in practice it has a hard time figuring out what the type of ColC is. Even if you were providing the schema of your data frame explicitly, it wouldn't work because this type of conversion (converting unions from pandas to arrow/parquet) isn't supported yet.

union_type = pa.union(
    [pa.field("0",pa.string()), pa.field("1", pa.list_(pa.string()))],
    'dense'
)
col_c_type = pa.struct(
    [
        pa.field('Field', union_type)
    ]
)

schema=pa.schema(
    [
        pa.field('ColA', pa.int32()),
        pa.field('ColB', pa.string()),
        pa.field('ColC', col_c_type),
    ]
)

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": ["Value3"] }
    ]
})

pa.Table.from_pandas(df, schema)

This gives you this error:

('Sequence converter for type union[dense]<0: string=0, 1: list<item: string>=1> not implemented', 'Conversion failed for column ColC with type object'

Even if you create the arrow table manually it won't be able to convert it to parquet (again, union are not supported).

import io
import pyarrow.parquet as pq

col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())

xs = pa.array(["Value", "Value2", None], type=pa.string())
ys = pa.array([None, None, ["value3"]], type=pa.list_(pa.string()))
types = pa.array([0, 0, 1], type=pa.int8())

col_c = pa.UnionArray.from_sparse(types, [xs, ys])

table = pa.Table.from_arrays(
    [col_a, col_b, col_c],
    schema=pa.schema([
        pa.field('ColA', col_a.type),
        pa.field('ColB', col_b.type),
        pa.field('ColC', col_c.type),
    ])
)

with io.BytesIO() as buffer:
    pq.write_table(table, buffer)

Unhandled type for Arrow to Parquet schema conversion: sparse_union<0: string=0, 1: list<item: string>=1>

I think your only option for now it to use a struct where fields have got different names for string value and list of string values.

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field1": "Value" },
        { "Field1": "Value2" },
        { "Field2": ["Value3"] }
    ]
})

df.to_parquet('/tmp/hello')

Politico answered 10/8, 2020 at 14:56 Comment(0)

I just had the same problem and fixed by converting ColC to string:

df['ColC'] = df['ColC'].astype(str)

I am not sure this would not create a problem in the future, don't quote me.

Contrapuntist answered 26/4, 2022 at 19:17 Comment(1)

Explanation: this works because it declares ColC as a consistent str type. In OP's original code, ColC is a mix of different types, so PyArrow cannot figure out what type is it. – Febrific 12/7 at 3:55

Recommended topics

Hot tags