Pandas / Python - Very slow performance using stack() groupby() and apply()
Asked Answered
S

5

7

I am trying to create a new column in a dataframe based on pairs of information and its previous values. Although the code that I run is correct, and gives the results I need, it is very slow when I run it on a large dataframe. So I susspect I am not using all of the Python power for this task. Is there a more efficient and faster way of doing this in Python?.

To put you in context, let me explain to you a little about what I am looking for:

I have a dataframe, which describes competitions results, where for each 'date' you can see the 'type' who competed and its score called 'xx'.

What my code does is to obtain the difference of score 'xx' between 'type' for each 'date' and then get the sum of difference of the results of the previous competitions that all the types competing with each other had in the past ('win_comp_past_difs').

Below you can see the data and the model with its output.

## I. DATA AND MODEL ##

I.1. Data

import pandas as pd
import numpy as np

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y') 
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')

Which looks like this:

                  xx
date       type
2018-01-01 A     1.0
           B     5.0
2018-02-01 B     3.0
2018-03-01 A     2.0
           B     7.0
           C     3.0
           D     1.0
           E     6.0
2018-05-01 B     3.0
2018-06-01 A     5.0
           B     2.0
           C     3.0
2018-07-01 A     1.0
2018-08-01 B     9.0
           C     3.0
2018-09-01 A     2.0
           B     7.0
2018-10-01 C     3.0
           A     6.0
           B     8.0
2018-11-01 A     2.0
2018-12-01 B     7.0
           C     9.0

I.2. Model (very slow in a large dataframe)

# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],columns = teams.values,index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()

# group by players
groups = new_df.groupby(level=[1,2])

# sum function
def cumsum_shift(x):
    return x.cumsum().shift()

# assign new values
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])

Below you can see how the output of the model looks like:

                  xx  win_comp_past_difs
date       type
2018-01-01 A     1.0                 0.0
           B     5.0                 0.0
2018-02-01 B     3.0                 NaN
2018-03-01 A     2.0                -4.0
           B     7.0                 4.0
           C     3.0                 0.0
           D     1.0                 0.0
           E     6.0                 0.0
2018-05-01 B     3.0                 NaN
2018-06-01 A     5.0               -10.0
           B     2.0                13.0
           C     3.0                -3.0
2018-07-01 A     1.0                 NaN
2018-08-01 B     9.0                 3.0
           C     3.0                -3.0
2018-09-01 A     2.0                -6.0
           B     7.0                 6.0
2018-10-01 C     3.0               -10.0
           A     6.0               -10.0
           B     8.0                20.0
2018-11-01 A     2.0                 NaN
2018-12-01 B     7.0                14.0
           C     9.0               -14.0

Just in case it is difficult for you to understand what does the User-Defined function (def) do, let me explain it to you below.

For this porpouse I will work with one group of the groupby of the dataframe.

Below you will see an explanation of how the User-Defines function work.

## II. EXPLANATION OF THE USER-DEFINED FUNCTION ##

So, for you to see how the User-defined function work let me select an specific group of the groupby.

II.1 Choosing an specific group

gb = df.groupby('date')
gb2 = gb.get_group((list(gb.groups)[2]))

Which looks like this:

                    xx
  date       type
  2018-03-01 A     2.0
             B     7.0
             C     3.0
             D     1.0
             E     6.0

II.2 Creating a list of competitors (teams)'

teams = gb2.index.get_level_values(1)

II.3 Creating a dataframe of the difference of 'xx' between 'type'

df_comp= pd.DataFrame(gb2.xx[:,None]-gb2.xx[None,:],columns = teams.values,index=teams.values)

Which looks like this:

    A    B    C    D    E
  A  0.0 -5.0 -1.0  1.0 -4.0
  B  5.0  0.0  4.0  6.0  1.0
  C  1.0 -4.0  0.0  2.0 -3.0
  D -1.0 -6.0 -2.0  0.0 -5.0
  E  4.0 -1.0  3.0  5.0  0.0

From this point I use the stack() function as an intermediate step to go back to the original dataframe. The rest you can follow it in the I. DATA AND MODEL.

If you could elaborate on the code to make it more efficient and execute faster, I would really appreciate it.

Segal answered 12/2, 2020 at 18:38 Comment(19)
How large of a DataFrame?Sechrist
I have around 28,000 dates and 17,000 typesSegal
for the NaNs, is it OK to make 0.0 instead of NaN ?Stadler
Yes, It is okay to make 0.0 instead of NaN for NaN'sSegal
What my code does is to obtain the difference of score 'xx' between 'type' for each 'date' and then get the sum of difference of the results of the previous competitions that all the types competing with each other had in the past ('win_comp_past_difs'). Can you explain this a bit more?Grannie
I explain with an example: On 2018-06-01, the type A has a value of -10 given that he previously lost with type B on 2018-01-01 by a difference of -4 (=1-5) and on 2018-03-01 by a diffrence of -5 (=2-7) and with type C on 2018-03-01 by -1 (=2-3). Thus adding -4-5-1=-10.Segal
how long does your solution take to run on your real dataset ?Stadler
Usually about 1-2 hoursSegal
I modify function get_diff and tested on the sample data above. The modified version about 40% faster. I think it is still not fast enough on your real datasetStadler
40% faster is quite good. Can you show the code in an answer please. Thanks Andy L!Segal
I posted it. I hope it performs well on your real datasetStadler
Do you mean 17000 different types?Sechrist
yes, 17 thousand different typesSegal
In the example above of type A, why does not diff btw A and D (1) and E (-4) enter into the calculation? Both D and E are also present on 2018-03-01...using this would give: -4 - 5 - 1 +1 - 4 = -13Sechrist
Because D and E are not competing at that date...only A, B and C.Segal
It would be helpful if you could add some more information about the data set to your question. How are the 17k different types distributed? How many participants typically at each date? Maybe a slightly bigger dataset.Sechrist
Usually 10-12 teams per game.Segal
And over the course of 28k dates how many different teams does a team typically meet?Sechrist
The number of times a team is usually active is about 400, playing around 80 per year.Segal
S
4

I only modify the get_diff. The main points are moving stack to outside of get_diff and taking advantange of stack's feature that it drops NaN to avoid the filtering inside of get_diff.

The new get_diff_s uses np.fill to fill all diagonal values to NaN and return a dataframe instead of filtered series.

def get_diff_s(x):
    teams = x.index.get_level_values(1)
    arr = x[:,None]-x[None,:]
    np.fill_diagonal(arr, np.nan)    
    return pd.DataFrame(arr,columns = teams.values,index=teams.values)

df['win_comp_past_difs'] = (df.groupby('date').xx.apply(get_diff_s)
                              .groupby(level=1).cumsum().stack()
                              .groupby(level=[1,2]).shift().sum(level=[0, 1]))

Out[1348]:
                  xx  win_comp_past_difs
date       type
2018-01-01 A     1.0                 0.0
           B     5.0                 0.0
2018-02-01 B     3.0                 NaN
2018-03-01 A     2.0                -4.0
           B     7.0                 4.0
           C     3.0                 0.0
           D     1.0                 0.0
           E     6.0                 0.0
2018-05-01 B     3.0                 NaN
2018-06-01 A     5.0               -10.0
           B     2.0                13.0
           C     3.0                -3.0
2018-07-01 A     1.0                 NaN
2018-08-01 B     9.0                 3.0
           C     3.0                -3.0
2018-09-01 A     2.0                -6.0
           B     7.0                 6.0
2018-10-01 C     3.0               -10.0
           A     6.0               -10.0
           B     8.0                20.0
2018-11-01 A     2.0                 NaN
2018-12-01 B     7.0                14.0
           C     9.0               -14.0

Timing:

Original solution: (I chained all your commands into one-liner)

In [1352]: %timeit df.groupby('date').xx.apply(get_diff).groupby(level=[1,2]).a
      ...: pply(lambda x: x.cumsum().shift()).sum(level=[0,1])
82.9 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Revised solution:

In [1353]: %timeit df.groupby('date').xx.apply(get_diff_s).groupby(level=1).cum
      ...: sum().stack().groupby(level=[1,2]).shift().sum(level=[0,1])
47.1 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, on the sample data, it's about 40% faster. However, I don't know how it performs on your real dataset

Stadler answered 13/2, 2020 at 1:8 Comment(1)
Very interesting how you take advantange of stack's feature that it drops NaN to avoid the filtering inside of get_diff. I think this is a good advance. Thank you Andy L.Segal
A
2

There is huge overhead for your many layers of indexes.

The best way to tackle this in my opinion is through paralleling the processing of each groupby in different threads. There are my threads on that here in SO, might be helpful.

As an alternative, you may reduce your indexing overhead by managing the indexes yourself.

f, s, t, d = [], [], [], []

for _, sub in df.groupby('date').xx:
  date = sub.index.get_level_values(0)
  i    = sub.index.get_level_values(1)
  tmp  = (sub.values[:, None] - sub.values).ravel()

  f.extend(np.repeat(i, len(i)))
  s.extend(np.tile(i, len(i)))
  t.extend(tmp)
  d.extend(np.repeat(date, len(i)))

Then filter and do your cumsum+sum stuff.

inter = pd.DataFrame({'i0': d, 'i1': f, 'i2': s, 'i3': t}).query('i1 != i2')
df['rf'] = inter.assign(v=inter.groupby(['i1','i2']).i3.apply(lambda s: s.cumsum().shift())).set_index(['i0', 'i1']).v.sum(level=[0,1])

The second block should run really quickly even for huge data frames. The heavy processing is in the groupby, which is why a map-reduce/multi processing approach could be super helpful.

The enhancement for manual index handling in this case is around ~5x faster

1 loop, best of 3: 3.5 s per loop
1 loop, best of 3: 738 ms per loop

The idea is to try to give you some directions on where to improve. The operations are independent, so it should be feasible to execute each iteration in a different thread. You can also consider numba.

Allegorist answered 13/2, 2020 at 16:24 Comment(1)
Very interesting how manual index handling helps. Thank you rafaelc!Segal
S
1

I formulate the problem as I understand it and would like to suggest a slightly different approach that uses the built-ins. Two variations where the second one uses half the memory and runs in about half the time:

timeit -r10 event_score6(games, scores)                        
21.3 µs ± 165 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)

timeit -r10 event_score(events, games, scores)                 
42.8 µs ± 210 ns per loop (mean ± std. dev. of 10 runs, 10000 loops each)
#
# Assume game data comes from a csv-file that contains reasonably clean data.
#
# We have a list of games each with a list of participating teams and the
# scores for each team in the game.
#
# For each of the pairs in the current game first calculate the sum of the
# differences in score from the previous competitions (win_comp_past_difs);
# include only the pairs in the current game.  Second update each pair in the
# current game with the difference in scores.
#
# Using a defaultdict keep track of the scores for each pair in each game and
# update this score as each game is played.
#
import csv
from collections import defaultdict
from itertools import groupby
from itertools import permutations
from itertools import combinations
from math import nan as NaN


def read_data(data_file):
    """Read and group games and scores by event date

    Sort the participants in each game. Returns header, events, games,
    scores.
    """
    header = ""
    events = []
    games = []
    scores = []
    with open(data_file, newline='') as fd:
        sample = fd.read(1024)
        dialect = csv.Sniffer().sniff(sample)
        fd.seek(0)
        reader = csv.reader(fd, dialect)
        if csv.Sniffer().has_header(sample):
            header = next(reader)
        for event_date, row in groupby(reader, key=lambda r: r[0]):
            _, gg, ss = tuple(zip(*row))
            events.append(event_date.strip())
            gms = (tuple(g.strip() for g in gg))
            scr = (tuple(float(s) for s in ss))
            g, s = zip(*sorted(zip(gms, scr)))
            games.append(g)
            scores.append(s)
    return header, events, games, scores


def event_score(events, games, scores, wd=defaultdict(float)):
    """Score each event and calculare win_comp_past_difs iteratively

    Return the acuumulated state from all events and the
    win_comp_past_difs grouped by event.
    """
    wins = []
    for evnt, game, xx in zip(events, games, scores):
        evnt_wins = []
        if len(game) == 1:
            win_comp_past_difs = NaN
            evnt_wins.append(win_comp_past_difs)
            wins.append(evnt_wins)
            continue

        # Pairs and difference generator for current game.
        pairs = list(permutations(game, 2))
        dgen = (value[0] - value[1] for value in permutations(xx, 2))

        # Sum of differences from previous games including only pair of teams
        # in the current game.
        for team, result in zip(game, xx):
            win_comp_past_difs = sum(wd[key]
                                     for key in pairs if key[0] == team)
            evnt_wins.append(win_comp_past_difs)
        wins.append(evnt_wins)

        # Update pair differeces for current game.
        for pair, diff in zip(pairs, dgen):
            wd[pair] += diff
    return wd, wins


def event_score6(games, scores, wd=defaultdict(float)):
    """Score each game and calculare win_comp_past_difs iteratively

    Assume sorted order in each game. Return the acuumulated state from
    all events and the win_comp_past_difs grouped by event.
    """
    wins = []
    for game, xx in zip(games, scores):
        if len(game) == 1:
            wins.append((NaN,))
            continue

        # Pairs for current game.
        pairs = tuple(combinations(game, 2))

        # Sum of differences from previous games including
        # only pair of teams in the current game.
        win_comp_past_difs = defaultdict(float)
        for pair in pairs:
            tmp = wd[pair]
            win_comp_past_difs[pair[0]] += tmp
            win_comp_past_difs[pair[1]] -= tmp
        wins.append(tuple(win_comp_past_difs.values()))

        # Update pair differeces for current game.
        for pair, value in zip(pairs, combinations(xx, 2)):
            wd[pair] += value[0] - value[1]
    return wd, wins


h, events, games, scores = read_data('data2.csv')

wd, wins = event_score(events, games, scores)
wd6, wins6 = event_score6(games, scores)

print(h)
print("Elements ", len(wd))
for evnt, gm, sc, wns in zip(events, games, scores, wins):
    for team, result, win_comp_past_difs in zip(gm, sc, wns):
        print(f"{evnt} {team}: {result}\t{win_comp_past_difs: 5.1f}")

print(h)
print("Elements ", len(wd6))
for evnt, gm, sc, wns in zip(events, games, scores, wins6):
    for team, result, win_comp_past_difs in zip(gm, sc, wns):
        print(f"{evnt} {team}: {result}\t{win_comp_past_difs: 5.1f}")

A run of the code gives:

['Event', 'Team', 'Score']
Elements  20
Jan-18 A: 1.0     0.0
Jan-18 B: 5.0     0.0
Feb-18 B: 3.0     nan
Mar-18 A: 2.0    -4.0
Mar-18 B: 7.0     4.0
Mar-18 C: 3.0     0.0
Mar-18 D: 1.0     0.0
Mar-18 E: 6.0     0.0
May-18 B: 3.0     nan
Jun-18 A: 5.0   -10.0
Jun-18 B: 2.0    13.0
Jun-18 C: 3.0    -3.0
Jul-18 A: 1.0     nan
Aug-18 B: 9.0     3.0
Aug-18 C: 3.0    -3.0
Sep-18 A: 2.0    -6.0
Sep-18 B: 7.0     6.0
Oct-18 A: 6.0   -10.0
Oct-18 B: 8.0    20.0
Oct-18 C: 3.0   -10.0
Nov-18 A: 2.0     nan
Dec-18 B: 7.0    14.0
Dec-18 C: 9.0   -14.0
['Event', 'Team', 'Score']
Elements  10
Jan-18 A: 1.0     0.0
Jan-18 B: 5.0     0.0
Feb-18 B: 3.0     nan
Mar-18 A: 2.0    -4.0
Mar-18 B: 7.0     4.0
Mar-18 C: 3.0     0.0
Mar-18 D: 1.0     0.0
Mar-18 E: 6.0     0.0
May-18 B: 3.0     nan
Jun-18 A: 5.0   -10.0
Jun-18 B: 2.0    13.0
Jun-18 C: 3.0    -3.0
Jul-18 A: 1.0     nan
Aug-18 B: 9.0     3.0
Aug-18 C: 3.0    -3.0
Sep-18 A: 2.0    -6.0
Sep-18 B: 7.0     6.0
Oct-18 A: 6.0   -10.0
Oct-18 B: 8.0    20.0
Oct-18 C: 3.0   -10.0
Nov-18 A: 2.0     nan
Dec-18 B: 7.0    14.0
Dec-18 C: 9.0   -14.0

Using the file data2.csv

Event, Team, Score
Jan-18, A, 1
Jan-18, B, 5
Feb-18, B, 3
Mar-18, A, 2
Mar-18, B, 7
Mar-18, C, 3
Mar-18, D, 1
Mar-18, E, 6
May-18, B, 3
Jun-18, A, 5
Jun-18, B, 2
Jun-18, C, 3
Jul-18, A, 1
Aug-18, B, 9
Aug-18, C, 3
Sep-18, A, 2
Sep-18, B, 7
Oct-18, C, 3
Oct-18, A, 6
Oct-18, B, 8
Nov-18, A, 2
Dec-18, B, 7
Dec-18, C, 9
Sechrist answered 15/2, 2020 at 22:32 Comment(5)
I like this approach. Do you know an efficient way to create the list of tuples games and scores?Segal
Extended the proposed solution with an example that uses a csv-file as input.Sechrist
I like the way you formulate the problem,but I cheked the speed and it is still too slow, Thank you FredrikHedman.Segal
Well, so far I have not even tried to optimize...By what factor compared other solutions?Sechrist
about the same speed than the other two answers.Segal
S
-1

Here is the complete code that should run using the concurrent.futures library to parallelize the computation of the get_diff function:

import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y') 
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')

def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],columns = teams.values,index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]

with ProcessPoolExecutor() as executor:
    new_df = df.groupby('date').xx.apply(lambda x:executor.submit(get_diff,x).result()).to_frame()

# group by players
groups = new_df.groupby(level=[1,2])

def cumsum_shift(x):
    return x.cumsum().shift()

# assign new values
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
Segal answered 12/1, 2023 at 16:38 Comment(1)
Why is this down voted? Any explanations? It's incredibly annoying that people are allowed to down vote in SO without any comments.Thigh
S
-1

We can also use multiprocessing library, which allows for parallel processing on multiple processors.

import pandas as pd
import numpy as np
from multiprocessing import Pool

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y') 
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')

def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],columns = teams.values,index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]

with Pool() as executor:
    new_df = df.groupby('date').xx.apply(lambda x:executor.apply(get_diff,(x,))).to_frame()

# group by players
groups = new_df.groupby(level=[1,2])

def cumsum_shift(x):
    return x.cumsum().shift()

# assign new values
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
Segal answered 13/1, 2023 at 1:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.