I am trying to create a new column in a dataframe based on pairs of information and its previous values. Although the code that I run is correct, and gives the results I need, it is very slow when I run it on a large dataframe. So I susspect I am not using all of the Python power for this task. Is there a more efficient and faster way of doing this in Python?.
To put you in context, let me explain to you a little about what I am looking for:
I have a dataframe, which describes competitions results, where for each 'date' you can see the 'type' who competed and its score called 'xx'.
What my code does is to obtain the difference of score 'xx' between 'type' for each 'date' and then get the sum of difference of the results of the previous competitions that all the types competing with each other had in the past ('win_comp_past_difs').
Below you can see the data and the model with its output.
## I. DATA AND MODEL ##
I.1. Data
import pandas as pd
import numpy as np
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')
Which looks like this:
xx
date type
2018-01-01 A 1.0
B 5.0
2018-02-01 B 3.0
2018-03-01 A 2.0
B 7.0
C 3.0
D 1.0
E 6.0
2018-05-01 B 3.0
2018-06-01 A 5.0
B 2.0
C 3.0
2018-07-01 A 1.0
2018-08-01 B 9.0
C 3.0
2018-09-01 A 2.0
B 7.0
2018-10-01 C 3.0
A 6.0
B 8.0
2018-11-01 A 2.0
2018-12-01 B 7.0
C 9.0
I.2. Model (very slow in a large dataframe)
# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
teams = x.index.get_level_values(1)
tmp = pd.DataFrame(x[:,None]-x[None,:],columns = teams.values,index=teams.values).stack()
return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()
# group by players
groups = new_df.groupby(level=[1,2])
# sum function
def cumsum_shift(x):
return x.cumsum().shift()
# assign new values
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
Below you can see how the output of the model looks like:
xx win_comp_past_difs
date type
2018-01-01 A 1.0 0.0
B 5.0 0.0
2018-02-01 B 3.0 NaN
2018-03-01 A 2.0 -4.0
B 7.0 4.0
C 3.0 0.0
D 1.0 0.0
E 6.0 0.0
2018-05-01 B 3.0 NaN
2018-06-01 A 5.0 -10.0
B 2.0 13.0
C 3.0 -3.0
2018-07-01 A 1.0 NaN
2018-08-01 B 9.0 3.0
C 3.0 -3.0
2018-09-01 A 2.0 -6.0
B 7.0 6.0
2018-10-01 C 3.0 -10.0
A 6.0 -10.0
B 8.0 20.0
2018-11-01 A 2.0 NaN
2018-12-01 B 7.0 14.0
C 9.0 -14.0
Just in case it is difficult for you to understand what does the User-Defined function (def) do, let me explain it to you below.
For this porpouse I will work with one group of the groupby of the dataframe.
Below you will see an explanation of how the User-Defines function work.
## II. EXPLANATION OF THE USER-DEFINED FUNCTION ##
So, for you to see how the User-defined function work let me select an specific group of the groupby.
II.1 Choosing an specific group
gb = df.groupby('date')
gb2 = gb.get_group((list(gb.groups)[2]))
Which looks like this:
xx
date type
2018-03-01 A 2.0
B 7.0
C 3.0
D 1.0
E 6.0
II.2 Creating a list of competitors (teams)'
teams = gb2.index.get_level_values(1)
II.3 Creating a dataframe of the difference of 'xx' between 'type'
df_comp= pd.DataFrame(gb2.xx[:,None]-gb2.xx[None,:],columns = teams.values,index=teams.values)
Which looks like this:
A B C D E
A 0.0 -5.0 -1.0 1.0 -4.0
B 5.0 0.0 4.0 6.0 1.0
C 1.0 -4.0 0.0 2.0 -3.0
D -1.0 -6.0 -2.0 0.0 -5.0
E 4.0 -1.0 3.0 5.0 0.0
From this point I use the stack() function as an intermediate step to go back to the original dataframe. The rest you can follow it in the I. DATA AND MODEL.
If you could elaborate on the code to make it more efficient and execute faster, I would really appreciate it.
NaN
s, is it OK to make0.0
instead ofNaN
? – Stadlerget_diff
and tested on the sample data above. The modified version about 40% faster. I think it is still not fast enough on your real dataset – Stadler