I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": "Value3" }
]
})
df.to_parquet("test.parquet")
Now, that works perfectly fine, the problem is when one of the nested values of the dictionary has a different type than the rest. For instance:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": ["Value3"] }
]
})
df.to_parquet("test.parquet")
This throws the following error:
ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column ColC with type object')
Notice how, for the last row of the DF, the Field
property of the ColC
dictionary is a list instead of a string.
Is there any workaround to be able to store this DF as a Parquet file?
ColC
as a consistentstr
type. In OP's original code,ColC
is a mix of different types, so PyArrow cannot figure out what type is it. – Febrific