pythonpandasparallel-processingapplyembarrassingly-parallel

Parallelize pandas apply


New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.

My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.

This is the function I call via apply:

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

How can I speed it up?

edit

I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.


Solution

  • I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...

    Let's just start with some dates...

    import pandas as pd
    
    dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])
    

    We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

    from pandas.tseries.holiday import USFederalHolidayCalendar
    
    holiday_calendar = USFederalHolidayCalendar()
    holidays = holiday_calendar.holidays('2016-01-01')
    

    This gives us:

    DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
                   '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
                   '2016-11-24', '2016-12-26',
                   ...
                   '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
                   '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
                   '2030-11-28', '2030-12-25'],
                  dtype='datetime64[ns]', length=150, freq=None)
    

    Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

    indices = holidays.searchsorted(dates)
    # array([1, 6, 9, 3])
    next_nearest = holidays[indices]
    # DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)
    

    Then take the difference between the two:

    next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
    # array([15, 31, 14, 88])
    

    You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.