finding nested columns in pandas dataframe
Asked Answered
H

4

9

I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore this structure and just write those columns out as a (JSON) string.

So for the columns I've identified I am doing:

df[column] = df[column].astype(str)

However, I'm not sure which columns are nested and which are not. When I write with parquet, I see this message:

<stack trace redacted> 

  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>

This signals that I failed to convert one of my columns from a nested object to a string. But which column is to blame? How do I find out?

When I print the .dtypes of my pandas dataframe, I can't differentiate between string and nested values because both show up as object.

EDIT: the error gives a hint as to the nested column by showing struct details, but this quite time consuming to debug. Also it only prints the first error and if you have multiple nested columns this can get quite annoying

Hedwig answered 13/4, 2020 at 20:24 Comment(4)
When you say nested column, do you mean any column containing Python objects (list, dict, etc.)? And you'd like to convert these to strings?Umbles
It seems that some columns in your dataframe contain C objects that cannot be handled by pyarrow.parquet.write_table. "Nested column" is a term in parquet only and doesn't make much sense in "pandas dataframe". Please define those terms clearly.Beaumont
maybe df.applymap(type) in order to get type for each cell in your dataframe... df.applymap(type).eq(dict).any() return True if there are a dict in any cell for each column. So if we use df.applymap(type).eq(dict).any() we could filter the columns..Vagabond
@Vagabond I had to do this with a streaming dataset with data coming from Outlook API which always changed and sometimes came with nested & unnested columns. Your approach is very similair to my ownUdale
B
3

Casting nested structure to string

If I understand your question correctly, you want to serialize those nested Python objects (list, dict) within df to JSON strings and leave other elements unchanged. It is better to write your own casting method:

def json_serializer(obj):
    if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
        return json.dumps(obj)
    return obj

df = df.applymap(json_serializer)

In case the dataframe is huge, using astype(str) would be faster.

nested_cols = []
for c in df:
    if any(isinstance(obj, [list, dict]) for obj in df[c]):
        nested_cols.append(c)

for c in nested_cols:
    df[c] = df[c].astype(str) # this convert every element in the column independent of their types

This approach has a performance benefit thanks to the short-circuit evaluation in the call to any(...). It will return immediately once hitting the first nested object in the column and will not waste time checking the rest. If any of the "Dtype Introspection" methods fits your data, using that would be even faster.

Check the latest version of pyarrow

I assume that those nested structures need to be converted to string only because that they would cause error in pyarrow.parquet.write_table. Maybe you don't need to convert it at all, because the issue of handling nested columns in pyarrow has been reportedly solved recently (29th Mar 2020, ver 0.17.0). But the support may be problematic and under active discussion.

Beaumont answered 23/4, 2020 at 3:57 Comment(3)
This does not answer the question. I’m attempting to find which columns in the data frame contain data such as dictionaries or lists rather than strings. It is not my intention to drop those values. And as I showed in my question I can convert those columns to string but don’t want to convert every column since other columns have integer or Boolean valuesHedwig
@DanielKats Could you at least provide a minimum reproducible example? I really have trouble understanding your questions. What does your input look like and what is the expected output?Beaumont
I misunderstood your sentence "For now I want to ignore this structure...". Now I get that you want to serialize those structures (list, dict) to json strings and leave other columns unchanged.Beaumont
E
1

Using a general utility function like infer_dtype() in pandas you can determine if the column is nested or not.

from pandas.api.types import infer_dtype

for col in df.columns:
  if infer_dtype(df[col]) == 'mixed' : 
    # ‘mixed’ is the catchall for anything that is not otherwise specialized
    df[col] = df[col].astype('str')

If you are targeting specific data types, then see Dtype Introspection

Edom answered 25/4, 2020 at 3:10 Comment(0)
U
1

I had a simialir problem when working with Pyspark and a streaming Dataset, some columns were nested and some were not.

Given that your dataframe may look like this:

df = pd.DataFrame({'A' : [{1 : [1,5], 2 : [15,25], 3 : ['A','B']}],
                   'B' : [[[15,25,61],[44,22,87],['A','B',44]]],
                   'C' : [((15,25,87),(22,91))],
                   'D' : 15,
                   'E' : 'A'
                  })


print(df)

                                         A  \
0  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B                         C   D  E  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  ((15, 25, 87), (22, 91))  15  A  

We can stack your dataframe and use apply with type to get the type of each column and pass it to a dictionary.

df.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
out:
{'A': dict, 'B': list, 'C': tuple, 'D': int, 'E': str}

with this we can use a function to return a tuple of the nested & unnested columns.


Function

def find_types(dataframe):

    col_dict = dataframe.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
    unnested_columns = [k for (k,v) in col_dict.items() if v not in (dict,set,list,tuple)]
    nested_columns = list(set(col_dict.keys()) - set(unnested_columns))
    return nested_columns,unnested_columns
    

In Action.

nested,unested = find_types(df)

df[unested]

   D  E
0  15  A

print(df[nested])

                          C                                        A  \
0  ((15, 25, 87), (22, 91))  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  
Udale answered 29/4, 2020 at 18:13 Comment(0)
I
0

If you just want to find out which column(s) is(are) the culprit(s), then just write a loop that writes one column at a time and stores which ones fail...

bad_cols = []
for i in range(df.shape[1]):
    try:
        df.iloc[:, [i]].to_parquet(...)
    except KeyboardInterrupt:
        raise
    except Exception:  # you may want to catch ArrowInvalid exceptions instead
        bad_cols.append(i)
print(bad_cols)
Il answered 27/4, 2020 at 16:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.