Replace NaN with empty list in a pandas dataframe
Asked Answered
C

4

38

I'm trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn't allow me to properly apply the len() function. is there anyway to replace a NaN value with an actual empty list in pandas?

In [28]: d = pd.DataFrame({'x' : [[1,2,3], [1,2], np.NaN, np.NaN], 'y' : [1,2,3,4]})

In [29]: d
Out[29]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2        NaN  3
3        NaN  4

In [32]: d.x.replace(np.NaN, '[]', inplace=True)

In [33]: d
Out[33]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [34]: d.x.apply(len)
Out[34]:
0    3
1    2
2    2
3    2
Name: x, dtype: int64
Cantal answered 22/7, 2015 at 15:14 Comment(0)
A
44

This works using isnull and loc to mask the series:

In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d

Out[90]:
0    [1, 2, 3]
1       [1, 2]
2           []
3           []
dtype: object

In [91]:
d.apply(len)

Out[91]:
0    3
1    2
2    0
3    0
dtype: int64

You have to do this using apply in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series

EDIT

Using your updated sample the following works:

In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d

Out[100]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [102]:    
d['x'].apply(len)

Out[102]:
0    3
1    2
2    0
3    0
Name: x, dtype: int64
Andrea answered 22/7, 2015 at 15:18 Comment(1)
what if we want to extend to the multiple columns of dfIonone
O
12

To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.

isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values

A quick timing comparison:

def empty_assign_1(s):
    s[s.isna()].apply(lambda x: [])

def empty_assign_2(s):
    [[]] * s.isna().sum()

series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))

%timeit empty_assign_1(series)
>>> 61 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series)
>>> 2.17 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Nearly 10 times faster!

EDIT: Fixed a bug pointed out by @valentin

You have to be somewhat careful with data types when performing assignment in this case. In the example above, the test series is float, however, adding [] elements coerces the entire series to object. Pandas will handle that for you if you do something like

idx = series.isna()
series[isna] = series[isna].apply(lambda x: [])

Because the output of apply is itself a series. You can test live performance with assignment overhead like so (I've added a string value so the series with be an object, you could instead use a number as the replacement value rather than an empty list to avoid coercion).

def empty_assign_1(s):
    idx = s.isna()
    s[idx] = s[idx].apply(lambda x: [])

def empty_assign_2(s):
    idx = s.isna()
    s.loc[idx] = [[]] * idx.sum()

series = pd.Series(np.random.choice([1, 2, np.nan, '2'], 1000000))

%timeit empty_assign_1(series.copy())
>>> 45.1 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series.copy())
>>> 24 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

About 4 ms of that is related to the copy, 10x to 2x, still pretty great.

Oriana answered 21/5, 2020 at 21:30 Comment(2)
This answer is misleading since the implementation of the first function empty_assign_1() seems incorrect. It applies the lambda function on every element in the series instead of only on those where the value is actually NaN. It should be s[s.isna()].apply(...). Performing the timing comparison after this fix actually reverses the results so that the first function becomes faster.Mcavoy
Hah! You actually did catch an error, I seem to have forgotten that isna is not the reciprocal of dropna. That being said, the original post is still correct. The reason you're observing a reversal is because of the unnecessary constructor call to pd.Series (which is also quite slow). Just use [[]]*s.isna().sum() and you'll be back in business. The context of this specific question is complicated by replacing nans with a list because of the way pandas interprets list inputs so you'll need to create series with dtype='object' and .loc for assignment (or replace with a non list).Oriana
F
9

You can also use a list comprehension for this:

d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]
Fabozzi answered 4/8, 2020 at 13:13 Comment(0)
F
0
import pandas as pd
import numpy as np

data = {'column1': [[1, 2], [2, 3], np.nan, [4, 5], np.nan],
        'column2': [np.nan, "Hi", "Hello", np.nan, "H"]}

df = pd.DataFrame(data)

def replace_none_with_empty_list(x):
    if x is np.nan:
        return []
    else:
        return x

df = df.applymap(replace_none_with_empty_list)

print(df)

wherever NaN is there, this will remove with empty array.else retuns the same value

 column1 column2
0  [1, 2]      []
1  [2, 3]      Hi
2      []   Hello
3  [4, 5]      []
4      []       H
Foetus answered 8/9, 2023 at 10:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.