I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore this structure and just write those columns out as a (JSON) string.
So for the columns I've identified I am doing:
df[column] = df[column].astype(str)
However, I'm not sure which columns are nested and which are not. When I write with parquet, I see this message:
<stack trace redacted>
File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>
This signals that I failed to convert one of my columns from a nested object to a string. But which column is to blame? How do I find out?
When I print the .dtypes
of my pandas dataframe, I can't differentiate between string and nested values because both show up as object
.
EDIT: the error gives a hint as to the nested column by showing struct details, but this quite time consuming to debug. Also it only prints the first error and if you have multiple nested columns this can get quite annoying
list
,dict
, etc.)? And you'd like to convert these to strings? – Umblespyarrow.parquet.write_table
. "Nested column" is a term in parquet only and doesn't make much sense in "pandas dataframe". Please define those terms clearly. – Beaumontdf.applymap(type)
in order to get type for each cell in your dataframe...df.applymap(type).eq(dict).any()
return True if there are a dict in any cell for each column. So if we usedf.applymap(type).eq(dict).any()
we could filter the columns.. – Vagabond