Split / Explode a column of dictionaries into separate columns with pandas
Asked Answered
I

13

384

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:

Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

I need to split this column into separate columns, so that the DataFrame `df2 looks like this:

Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).

The following code USED to work and return exactly what I wanted (df2).

objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end) 

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

Ibert answered 6/7, 2016 at 18:47 Comment(0)
E
356

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.


Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0
Elysia answered 6/7, 2016 at 18:51 Comment(9)
@Ibert If it is a string, you can convert that to an actual dict with df[col].map(eval) before converting it to a DataFrameElysia
Yes it seems to be faster but I think the intention is clearer with apply(pd.Series) because apply is flexible.Kill
Is .map(eval) a security risk if the strings may come at some point from unsanitised user input?Burka
Yes, but you can eg use literal_eval (docs.python.org/3/library/ast.html#ast.literal_eval) to limit parsingElysia
Works perfect, but is (much) slower than the new solution (2019) contributed by Lech Birek https://mcmap.net/q/82768/-split-explode-a-column-of-dictionaries-into-separate-columns-with-pandasElectioneer
could you add this https://mcmap.net/q/83107/-split-dictionary-into-individual-columns-in-a-df-duplicate in your answer, which is actually way more faster?Roomful
using apply(pd.Series) is insanely slow! IManic
Hi for pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1), this will spilt all the keys in the dictionaries, is there a way to only include a selection of them so I don't have to get a lot of unneeded columns?Callimachus
Do not ever use eval (or exec) on data that could possibly come from outside the program in any form. It is a critical security risk. You allow the author of the data to run arbitrary code on your computer. It cannot easily be sandboxed, and proper sandboxing is harder than using a proper tool for the job.Freudberg
C
347

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])

This avoids costly apply functions...

Corrincorrina answered 26/3, 2019 at 11:21 Comment(7)
My test shows this is indeed much faster than the .apply() approach in the accepted answerTse
pd.DataFrame(df['b'].tolist()) is better than this for flat structures, normalize does a lot under the hood for deeply nested dicts and will be slower. Recommend Trenton McKinney's answer to this question if you have a column of dicts.Strangulation
sounds interesting but in my case, when the values aren't uniform and there might be less keys in some than others, this didn't work.Ivelisseivens
I have a string of list of dictionaries. Any chances i could make it work?Sample
For my data set, this method is about 40x faster (0.7 vs. 29.2 seconds)! A great option!Blooded
Want to add that if each entry is a string and not a dict, you can use pd.json_normalize(df.b.apply(eval)). The apply(eval) converts the str to a dict in each row.Humanitarian
What will the function to revert operation by pd.json_normalize() ?Sawmill
R
102
  • The fastest method to normalize a column of flat, one-level dicts, as per the timing analysis performed by Shijith in this answer:
    • df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
    • It will not resolve other issues, with columns of list or dicts, that are addressed below, such as rows with NaN, or nested dicts.
  1. pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series)
    • See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply.
  2. Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type.
    • If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…).
  3. Use pd.json_normalize to convert the dicts, with keys as headers and values for rows.
    • There are additional parameters (e.g. record_path & meta) for dealing with nested dicts.
  4. Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize
    • If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join.
    • pandas.DataFrame.pop is used to remove the specified column from the existing dataframe. This removes the need to drop the column later, using pandas.DataFrame.drop.
  • As a note, if the column has any NaN, they must be filled with an empty dict
import pandas as pd
from ast import literal_eval
import numpy as np

data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
        'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}

df = pd.DataFrame(data)

# display(df)
   Station ID                        Pollutants
0        8809  {"a": "46", "b": "3", "c": "12"}
1        8810   {"a": "36", "b": "5", "c": "8"}
2        8811              {"b": "2", "c": "7"}
3        8812                       {"c": "11"}
4        8813            {"a": "82", "c": "15"}
5        8814                               NaN

# check the type of the first value in Pollutants
>>> print(type(df.iloc[0, 1]))
<class 'str'>

# replace NaN with '{}' if the column is strings, otherwise replace with {}
df.Pollutants = df.Pollutants.fillna('{}')  # if the NaN is in a column of strings
# df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})  # if the column is not strings

# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)

# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True)  # uncomment if needed

# remove and normalize the column of dictionaries, and join the result to df
df = df.join(pd.json_normalize(df.pop('Pollutants')))

# display(df)
   Station ID    a    b    c
0        8809   46    3   12
1        8810   36    5    8
2        8811  NaN    2    7
3        8812  NaN  NaN   11
4        8813   82  NaN   15
5        8814  NaN  NaN  NaN

%%timeit

# dataframe with 1M rows
dfb = pd.concat([df]*20000).reset_index(drop=True)

%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
46.9 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
7.75 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Recognizee answered 8/8, 2020 at 2:49 Comment(0)
D
27

Try this: The data returned from SQL has to converted into a Dict. or could it be "Pollutant Levels" is now Pollutants'

   StationID                   Pollutants
0       8809  {"a":"46","b":"3","c":"12"}
1       8810   {"a":"36","b":"5","c":"8"}
2       8811            {"b":"2","c":"7"}
3       8812                   {"c":"11"}
4       8813          {"a":"82","c":"15"}


df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )

    a    b   c
0   46    3  12
1   36    5   8
2  NaN    2   7
3  NaN  NaN  11
4   82  NaN  15


result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result

   StationID    a    b   c
0       8809   46    3  12
1       8810   36    5   8
2       8811  NaN    2   7
3       8812  NaN  NaN  11
4       8813   82  NaN  15
Destinydestitute answered 6/7, 2016 at 20:47 Comment(0)
F
19

Note : for dictionary with depth=1 (one-level)

>>> df

   Station ID                        Pollutants
0        8809  {"a": "46", "b": "3", "c": "12"}
1        8810   {"a": "36", "b": "5", "c": "8"}
2        8811              {"b": "2", "c": "7"}
3        8812                       {"c": "11"}
4        8813            {"a": "82", "c": "15"}

speed comparison for a large dataset of 10 million rows

>>> df = pd.concat([df]*2000000).reset_index(drop=True)
>>> print(df.shape)
(10000000, 2)
def apply_drop(df):
    return df.join(df['Pollutants'].apply(pd.Series)).drop('Pollutants', axis=1)  

def json_normalise_drop(df):
    return df.join(pd.json_normalize(df.Pollutants)).drop('Pollutants', axis=1)  

def tolist_drop(df):
    return df.join(pd.DataFrame(df['Pollutants'].tolist())).drop('Pollutants', axis=1)  

def vlues_tolist_drop(df):
    return df.join(pd.DataFrame(df['Pollutants'].values.tolist())).drop('Pollutants', axis=1)  

def pop_tolist(df):
    return df.join(pd.DataFrame(df.pop('Pollutants').tolist()))  

def pop_values_tolist(df):
    return df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))

>>> %timeit apply_drop(df.copy())
1 loop, best of 3: 53min 20s per loop
>>> %timeit json_normalise_drop(df.copy())
1 loop, best of 3: 54.9 s per loop
>>> %timeit tolist_drop(df.copy())
1 loop, best of 3: 6.62 s per loop
>>> %timeit vlues_tolist_drop(df.copy())
1 loop, best of 3: 6.63 s per loop
>>> %timeit pop_tolist(df.copy())
1 loop, best of 3: 5.99 s per loop
>>> %timeit pop_values_tolist(df.copy())
1 loop, best of 3: 5.94 s per loop
+---------------------+-----------+
| apply_drop          | 53min 20s |
| json_normalise_drop |    54.9 s |
| tolist_drop         |    6.62 s |
| vlues_tolist_drop   |    6.63 s |
| pop_tolist          |    5.99 s |
| pop_values_tolist   |    5.94 s |
+---------------------+-----------+

df.join(pd.DataFrame(df.pop('Pollutants').values.tolist())) is the fastest

Frohman answered 27/10, 2020 at 22:45 Comment(0)
S
19

How do I split a column of dictionaries into separate columns with pandas?

pd.DataFrame(df['val'].tolist()) is the canonical method for exploding a column of dictionaries

Here's your proof using a colorful graph.

enter image description here

Benchmarking code for reference.

Note that I am only timing the explosion since that's the most interesting part of answering this question - other aspects of result construction (such as whether to use pop or drop) are tangential to the discussion and can be ignored (it should be noted however that using pop avoids the followup drop call, so the final solution is a bit more performant, but we are still listifying the column and passing it to pd.DataFrame either way).

Additionally, pop destructively mutates the input DataFrame, making it harder to run in benchmarking code which assumes the input is not changed across test runs.


Critique of other solutions

  • df['val'].apply(pd.Series) is extremely slow for large N as pandas constructs Series objects for each row, then proceeds to construct a DataFrame from them. For larger N the performance dips to the order of minutes or hours.

  • pd.json_normalize(df['val'])) is slower simply because json_normalize is meant to work with a much more complex input data - particularly deeply nested JSON with multiple record paths and metadata. We have a simple flat dict for which pd.DataFrame suffices, so use that if your dicts are flat.

  • Some answers suggest df.pop('val').values.tolist() or df.pop('val').to_numpy().tolist(). I don't think it makes much of a difference whether you listify the series or the numpy array. It's one operation less to listify the series directly and really isn't slower so I'd recommend avoiding generating the numpy array in the intermediate step.

Strangulation answered 16/2, 2021 at 9:42 Comment(2)
This comparison is useful, though it's not clear to me if it would work for dictionaries where the shapes are different. It certainly works fine if all the dictionaries have the same keys.Rehearing
fwiw, df['col'].tolist() does not preserve the index while df['col'].apply(np.Series) does.Ailin
M
17

I strongly recommend the method extract the column 'Pollutants':

df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)

it's much faster than

df_pollutants = df['Pollutants'].apply(pd.Series)

when the size of df is giant.

Mancy answered 10/5, 2019 at 8:31 Comment(2)
would be great if you could explain how/why this works and is so much better! for me it's always faster, and ~200 times faster once you get more than ~1000 rowsLabrador
@SamMason when you do apply the entire data frame is managed by pandas, but when it comes to values it only plays with the numpy ndarrays which is intrincicly faster due to the fact that it has pure c implementations.Ka
S
14

Merlin's answer is better and super easy, but we don't need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:

Way 1: Two steps

# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)

df_pol_ps:
    a   b   c
0   46  3   12
1   36  5   8
2   NaN 2   7
3   NaN NaN 11
4   82  NaN 15

# step 2: concat columns `a, b, c` and drop/remove the `Pollutants` 
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

Way 2: The above two steps can be combined in one go:

df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15
Synagogue answered 16/3, 2018 at 3:47 Comment(0)
W
11

You can use join with pop + tolist. Performance is comparable to concat with drop + tolist, but some may find this syntax cleaner:

res = df.join(pd.DataFrame(df.pop('b').tolist()))

Benchmarking with other methods:

df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

def joris1(df):
    return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)

def joris2(df):
    return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)

def jpp(df):
    return df.join(pd.DataFrame(df.pop('b').tolist()))

df = pd.concat([df]*1000, ignore_index=True)

%timeit joris1(df.copy())  # 1.33 s per loop
%timeit joris2(df.copy())  # 7.42 ms per loop
%timeit jpp(df.copy())     # 7.68 ms per loop
Waine answered 22/8, 2018 at 10:25 Comment(0)
T
5

One line solution is following:

>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
   Station ID    a    b   c
0        8809   46    3  12
1        8810   36    5   8
2        8811  NaN    2   7
3        8812  NaN  NaN  11
4        8813   82  NaN  15
Traitorous answered 7/10, 2019 at 14:12 Comment(0)
D
2
df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)
Duero answered 7/7, 2016 at 23:33 Comment(0)
O
1

I've concatenated those steps in a method, you have to pass only the dataframe and the column which contains the dict to expand:

def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
    """
    dw: DataFrame with some column which contain a dict to expand
        in columns
    column_to_expand: String with column name of dw
    """
    import pandas as pd

    def convert_to_dict(sequence: str) -> Dict:
        import json
        s = sequence
        json_acceptable_string = s.replace("'", "\"")
        d = json.loads(json_acceptable_string)
        return d    

    expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
                                    dw[column_to_expand]
                                    .apply(convert_to_dict)
                                    .apply(pd.Series)],
                                    axis=1)
    return expanded_dataframe
Ondine answered 26/3, 2020 at 14:17 Comment(0)
W
1

my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])

.. would have parsed the dict properly (putting each dict key into a separate df column, and key values into df rows), so the dicts would not get squashed into a single column in the first place.

Whiplash answered 26/6, 2020 at 10:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.