I have a large number of files for which I have to carry out calculations based on string columns. The relevant columns look like this.
df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})
A B C D
0 A B A A
1 B C B C
2 A D D C
3 B A D B
I have to create new columns containing the number of occurences of certain strings in each row. I do this like this:
for elem in ['A', 'B', 'C', 'D']:
df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)
A B C D n_A n_B n_C n_D
0 A B A A 3 1 0 0
1 B C B C 0 2 2 0
2 A D D C 1 0 1 2
3 B A D B 1 2 0 1
However, this is taking minutes per file, and I have to do this for around 900 files. Is there any way I can speed this up?