Asked 2/9, 2016 at 5:37 Answered 23/11, 2022 at 6:54

Solved python pandas parallel-processing apply embarrassingly-parallel

New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.

My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.

This is the function I call via apply:

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

How can I speed it up?

edit

I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

Frailty answered 2/9, 2016 at 5:37 Comment(17)

"python pools" - threads or processes? – Unshod 2/9, 2016 at 5:57

I was using multiprocessing.Pool(processes= #ofCPU) – Frailty 2/9, 2016 at 6:0

So multiprocessing is not guaranteed to speed up your code, but, since the code wasn't working correctly, it's hard to know what at all it was running there. You might want to make your question about that (FWIW, this approach looks like your best bet to me). – Unshod 2/9, 2016 at 6:4

Would cythonizing not be a good first step before you resort to parallelizing apply? – Mammilla 2/9, 2016 at 7:21

As far as I understand the problem it is embarrassingly parallel e.g. each row is independent, so parallel execution should be better suited. – Frailty 2/9, 2016 at 8:43

@geoHeil I'll be happy to look at it a bit later. – Unshod 2/9, 2016 at 9:58

Have you considered using dask and a dask.dataframe instead? It would give you an easy way to parallelize your calculation "for free" – Bicarb 2/9, 2016 at 13:57

I will have to look into that. Is it generally as good as pandas but only parallelized? – Frailty 2/9, 2016 at 14:17

@AmiTavory I added a minimal code example on github: github.com/geoHeil/pythonQuestions/blob/master/… – Frailty 3/9, 2016 at 15:25

@geoHeil Sorry about that. I'll really try to have a look a bit later on. – Unshod 3/9, 2016 at 15:27

@geoHeil So just a question about the setting - are the holidays all (or most) at fixed dates each year, or do they vary? – Unshod 4/9, 2016 at 6:13

They are from timeanddate.com/holidays/germany I filter for national holidays. These are usually fixed e.g. Christmas on the 24th of December – Frailty 4/9, 2016 at 6:16

@AmiTavory I just updated the minimum example: github.com/geoHeil/pythonQuestions/blob/master/… even for my approach 3 only the existing column is returned. I do not see the computed result. – Frailty 4/9, 2016 at 6:46

@geoHeil I've attempted an answer with an approach that doesn't rely on parallel stuff... Wasn't 100% sure how adamant you were you wanted that or whether you were just trying anything to speed things up... – Suppositious 4/9, 2016 at 9:2

@NinjaPuppy will need to try your solution first but I am looking for a quicker solution. As the problem should parallelize fine I thought this would be the way to go. Unfortunately I am new to python and struggle to set up parallel processing correctly. As you can see the computation is performed in parallel but the results are not returned. If you have an idea what is wrong I would be glad. – Frailty 4/9, 2016 at 9:7

@geoHeil I generally find that by the time I've got parallel stuff working properly I could have just ran the thing a few hundred times and have been done already :p Anyway... hopefully the suggestion should be fast enough... don't fancy looking into parallel stuff for a Sunday morning :) – Suppositious 4/9, 2016 at 9:8

That sounds great – Frailty 4/9, 2016 at 9:9

I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...

Let's just start with some dates...

import pandas as pd

dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

from pandas.tseries.holiday import USFederalHolidayCalendar

holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')

This gives us:

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
               '2016-11-24', '2016-12-26',
               ...
               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
               '2030-11-28', '2030-12-25'],
              dtype='datetime64[ns]', length=150, freq=None)

Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

Then take the difference between the two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])

You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.

Suppositious answered 4/9, 2016 at 8:55 Comment(2)

I updated the minimum example with your code (please see at the bootom). Trying to use "my dateimeIndices" for the holidays I receive an index out of bounds. – Frailty 4/9, 2016 at 9:34

Comments are not for extended discussion; this conversation has been moved to chat. – Suppositious 4/9, 2016 at 9:41

For the parallel approach this is the answer based on Parallelize apply after pandas groupby:

from joblib import Parallel, delayed
import multiprocessing

def get_nearest_dateParallel(df):
    df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))
    df['daysAfterHoliday']  =  df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

print ('parallel version: ')
# 4 min 30 seconds
%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)

but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)

Frailty answered 4/9, 2016 at 11:59 Comment(0)

I think that the pandarallel package makes it way easier to do this now. Have not looked into it much, but should do the trick.

Bud answered 30/7, 2021 at 17:0 Comment(0)

Let's just start with some dates...

import pandas as pd

dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

from pandas.tseries.holiday import USFederalHolidayCalendar

holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')

This gives us:

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
               '2016-11-24', '2016-12-26',
               ...
               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
               '2030-11-28', '2030-12-25'],
              dtype='datetime64[ns]', length=150, freq=None)

Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

Then take the difference between the two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])

You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.

Suppositious answered 4/9, 2016 at 8:55 Comment(2)

I updated the minimum example with your code (please see at the bootom). Trying to use "my dateimeIndices" for the holidays I receive an index out of bounds. – Frailty 4/9, 2016 at 9:34

Comments are not for extended discussion; this conversation has been moved to chat. – Suppositious 4/9, 2016 at 9:41

You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!

# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

def foo(x):
    """Your awesome function"""
    return np.sqrt(np.sum(x ** 2))    

df = pd.DataFrame(np.random.random((1000, 1000)))

%%time
res = df.apply(foo, raw=True)

Wall time: 5.3 s

# p_apply - is parallel analogue of apply method
%%time
res = df.p_apply(foo, raw=True, executor='processes')

Wall time: 1.2 s

Caller answered 23/11, 2022 at 6:54 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

edit

Recommended topics

Hot tags