How to speed up pandas apply for string matching

A

4

5

I have a large number of files for which I have to carry out calculations based on string columns. The relevant columns look like this.

df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})

    A   B   C   D
0   A   B   A   A
1   B   C   B   C
2   A   D   D   C
3   B   A   D   B

I have to create new columns containing the number of occurences of certain strings in each row. I do this like this:

for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)

   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

However, this is taking minutes per file, and I have to do this for around 900 files. Is there any way I can speed this up?

Atkins answered 12/8, 2020 at 13:57 Comment(2)

the column names and the values are the same for a reason..? – Deshawndesi 12/8, 2020 at 13:59

No I just did it like this for simplicity – Atkins 12/8, 2020 at 14:29

V

3

Rather than use apply to loop over each row, I looped over each column to compute the sum for each letter:

for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)

This seems to be an improvement in this example, but (from quick testing not shown) seems like it can be ~equal or worse depending on the shape and size of the data (and probably how many strings you are looking for)

Some time comparisons:

%%timeit
for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)    
#6.77 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)
#1.95 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And for other answers here:

%%timeit
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
#3.59 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
#5.82 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
#5.58 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Vernellvernen answered 12/8, 2020 at 14:13 Comment(1)

Thank you, this method is the fastest, also on the actual data. I have to do this for 250 string values (creating 250 new columns), checking strings in 10 columns. My way was 757 seconds, yours 5.9 seconds. Massive improvement! – Atkins 12/8, 2020 at 15:36

E

6

Use stack + str.get_dummies and then sum on level=0 and join it with df:

df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))

Result:

print(df1)
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

Ephemeron answered 12/8, 2020 at 14:2 Comment(0)

B

3

Try get_dummies and sum with level , here we do not need stack :-)

df=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
Out[57]: 
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

Bunnybunow answered 12/8, 2020 at 14:3 Comment(0)

V

3

Rather than use apply to loop over each row, I looped over each column to compute the sum for each letter:

for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)

This seems to be an improvement in this example, but (from quick testing not shown) seems like it can be ~equal or worse depending on the shape and size of the data (and probably how many strings you are looking for)

Some time comparisons:

%%timeit
for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)    
#6.77 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)
#1.95 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And for other answers here:

%%timeit
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
#3.59 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
#5.82 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
#5.58 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Vernellvernen answered 12/8, 2020 at 14:13 Comment(1)

Thank you, this method is the fastest, also on the actual data. I have to do this for 250 string values (creating 250 new columns), checking strings in 10 columns. My way was 757 seconds, yours 5.9 seconds. Massive improvement! – Atkins 12/8, 2020 at 15:36

A

1

you could do:

counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)

Adversative answered 12/8, 2020 at 14:4 Comment(0)

Recommended topics

Hot tags