Parallelize pandas apply
Asked Answered
F

4

9

New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.

My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.

This is the function I call via apply:

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

How can I speed it up?

edit

I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

Frailty answered 2/9, 2016 at 5:37 Comment(17)
"python pools" - threads or processes?Unshod
I was using multiprocessing.Pool(processes= #ofCPU)Frailty
So multiprocessing is not guaranteed to speed up your code, but, since the code wasn't working correctly, it's hard to know what at all it was running there. You might want to make your question about that (FWIW, this approach looks like your best bet to me).Unshod
Would cythonizing not be a good first step before you resort to parallelizing apply?Mammilla
As far as I understand the problem it is embarrassingly parallel e.g. each row is independent, so parallel execution should be better suited.Frailty
@geoHeil I'll be happy to look at it a bit later.Unshod
Have you considered using dask and a dask.dataframe instead? It would give you an easy way to parallelize your calculation "for free"Bicarb
I will have to look into that. Is it generally as good as pandas but only parallelized?Frailty
@AmiTavory I added a minimal code example on github: github.com/geoHeil/pythonQuestions/blob/master/…Frailty
@geoHeil Sorry about that. I'll really try to have a look a bit later on.Unshod
@geoHeil So just a question about the setting - are the holidays all (or most) at fixed dates each year, or do they vary?Unshod
They are from timeanddate.com/holidays/germany I filter for national holidays. These are usually fixed e.g. Christmas on the 24th of DecemberFrailty
@AmiTavory I just updated the minimum example: github.com/geoHeil/pythonQuestions/blob/master/… even for my approach 3 only the existing column is returned. I do not see the computed result.Frailty
@geoHeil I've attempted an answer with an approach that doesn't rely on parallel stuff... Wasn't 100% sure how adamant you were you wanted that or whether you were just trying anything to speed things up...Suppositious
@NinjaPuppy will need to try your solution first but I am looking for a quicker solution. As the problem should parallelize fine I thought this would be the way to go. Unfortunately I am new to python and struggle to set up parallel processing correctly. As you can see the computation is performed in parallel but the results are not returned. If you have an idea what is wrong I would be glad.Frailty
@geoHeil I generally find that by the time I've got parallel stuff working properly I could have just ran the thing a few hundred times and have been done already :p Anyway... hopefully the suggestion should be fast enough... don't fancy looking into parallel stuff for a Sunday morning :)Suppositious
That sounds greatFrailty
S
4

I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...

Let's just start with some dates...

import pandas as pd

dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

from pandas.tseries.holiday import USFederalHolidayCalendar

holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')

This gives us:

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
               '2016-11-24', '2016-12-26',
               ...
               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
               '2030-11-28', '2030-12-25'],
              dtype='datetime64[ns]', length=150, freq=None)

Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

Then take the difference between the two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])

You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.

Suppositious answered 4/9, 2016 at 8:55 Comment(2)
I updated the minimum example with your code (please see at the bootom). Trying to use "my dateimeIndices" for the holidays I receive an index out of bounds.Frailty
Comments are not for extended discussion; this conversation has been moved to chat.Suppositious
F
6

For the parallel approach this is the answer based on Parallelize apply after pandas groupby:

from joblib import Parallel, delayed
import multiprocessing

def get_nearest_dateParallel(df):
    df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))
    df['daysAfterHoliday']  =  df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

print ('parallel version: ')
# 4 min 30 seconds
%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)

but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)

Frailty answered 4/9, 2016 at 11:59 Comment(0)
B
6

I think that the pandarallel package makes it way easier to do this now. Have not looked into it much, but should do the trick.

Bud answered 30/7, 2021 at 17:0 Comment(0)
S
4

I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...

Let's just start with some dates...

import pandas as pd

dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

from pandas.tseries.holiday import USFederalHolidayCalendar

holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')

This gives us:

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
               '2016-11-24', '2016-12-26',
               ...
               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
               '2030-11-28', '2030-12-25'],
              dtype='datetime64[ns]', length=150, freq=None)

Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

Then take the difference between the two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])

You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.

Suppositious answered 4/9, 2016 at 8:55 Comment(2)
I updated the minimum example with your code (please see at the bootom). Trying to use "my dateimeIndices" for the holidays I receive an index out of bounds.Frailty
Comments are not for extended discussion; this conversation has been moved to chat.Suppositious
C
4

You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!

# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

def foo(x):
    """Your awesome function"""
    return np.sqrt(np.sum(x ** 2))    

df = pd.DataFrame(np.random.random((1000, 1000)))

%%time
res = df.apply(foo, raw=True)

Wall time: 5.3 s

# p_apply - is parallel analogue of apply method
%%time
res = df.p_apply(foo, raw=True, executor='processes')

Wall time: 1.2 s
Caller answered 23/11, 2022 at 6:54 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.