I'm trying to figure out an efficient split/apply/combine scheme for the following scenario. Consider the pandas dataframe demoAll
defined below:
import datetime
import pandas as pd
demoA = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['A', 'A', 'A'],
'x1':[10,20,30],
'close':[120, 133, 129]}).set_index('date', drop=True)
demoB = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['B', 'B', 'B'],
'x1':[18,11,45],
'close':[50, 49, 51]}).set_index('date', drop=True)
demoAll = pd.concat([demoA, demoB])
print(demoAll)
The result is:
ticker x1 close
date
2010-01-01 A 10 120
2010-01-02 A 20 133
2010-01-03 A 30 129
2010-01-01 B 18 50
2010-01-02 B 11 49
2010-01-03 B 45 51
I also have a dictionary mapping of tickers to model objects
ticker2model = {'A':model_A, 'B':model_B,...}
where each model has a self.predict(df)
method that takes-in an entire dataframe and returns a series of the same length.
I now would like to create a new column, demoAll['predictions']
, that corresponds to these predictions. What is the cleanest/most-efficient way of doing this? A few things to note:
demoAll
was the concatenation of ticker-specific dataframes that were each indexed just by date. Thus the indices of demoAll
are not unique. (However, the combination of date/ticker IS unique.)
My thinking has been to do something like the example below, but running into issues with indexing, data-type coercions, and slow run times. The real dataset is quite large (both rows and columns).
demoAll['predictions'] = demoAll.groupby('ticker').apply(
lambda x: ticker2model[x.name].predict(x)
)
I may have misunderstood what you are passing through the model in order to predict but if I have understood correctly I would do the following:
predictions
columns of the demoAll
demoAll
demoAll['predictions']
An Example using your code:
# get non 'ticker' columns
non_ticker_cols = [col for col in demoAll.columns if col is not 'ticker']
# get unique set of tickers
tickers = demoAll.ticker.unique()
# create and prepopulate the predictions column
demoAll['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = demoAll.ticker == ticker
# filter, predict and allocate
demoAll.loc[filter_by_ticker, 'predictions'] = ticker2model[
ticker].predict(
demoAll.loc[filter_by_ticker,
non_ticker_cols]
)
The output would look like:
ticker x1 close predictions
date
2010-01-01 A 10 120 10.0
2010-01-02 A 20 133 10.0
2010-01-03 A 30 129 10.0
2010-01-01 B 18 50 100.0
2010-01-02 B 11 49 100.0
2010-01-03 B 45 51 100.0
Comparison to using apply
We could use apply per row but as you mentioned it would slow. I will compare the two to give an idea of the speedup.
Setup
I will use DummyRegressor
from sklearn
to allow me to call a predict
method and create the dictionary you mention in your question.
model_a = DummyRegressor(strategy='mean')
model_b = DummyRegressor(strategy='median')
model_a.fit([[10,14]], y=np.array([10]))
model_b.fit([[200,200]], [100])
ticker2model = {'A':model_a, 'B':model_b}
Defining both as functions
def predict_by_ticker_filter(df, model_dict):
# get non 'ticker' columns
non_ticker_cols = [col for col in df.columns if col is not 'ticker']
# get unique set of tickers
tickers = df.ticker.unique()
# create and prepopulate the predictions column
df['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = df.ticker==ticker
# filter, predict and allocate
df.loc[filter_by_ticker,'predictions'] = model_dict[ticker].predict(
df.loc[filter_by_ticker,
non_ticker_cols]
)
return df
def model_apply_by_row(df_row, model_dict):
# includes some conversions to list to allow the predict method to run
return model_dict[df_row['ticker']].predict([df_row[['x1','close']].tolist()])[0]
Performance
I use timeit
on the function call to give the following results
On your example demoAll
:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.78 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.24 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the size of demoAll
to (606, 3)
:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
320 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.1 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the size of demoAll
to (6006, 3)
:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.15 s ± 371 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
9.1 ms ± 767 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)