I have a pandas DataFrame
with a time series (in this case price of a used car model) and am looking for a vectorized function to map each cell to the time it takes the price to grow n
percent from that cell (if it never reaches n
% more then return nan
)
It should in theory be possible to execute in a vectorized way as the output of each row is independent of what comes before it
Here is a sample of the data and expected output
import numpy as np
import pandas as pd
import datetime
df = pd.DataFrame(
[
1490.47,
1492.98,
1494.69,
1497.43,
1499.02,
1503.29,
1501.60,
1502.80,
1502.30,
1509.38,
1512.01,
1508.98,
1512.63,
],
columns=['price'],
)
df.index.names = ['time']
n=1/100
So in this case I want to run for n=1/100
so compute for each cell the time it takes the price to increase by 1%. So for the first cell, 1% grows would be at 1490.47*1.01 = 1505.3747
, the first cell greater than this value is 1509.38
which is 9 cells after the first cell so the output for that cell would be 9
, and so on for the rest...
Expected output would then be:
df.some_functions(n=1/100)
print(df)
price
time
0 9
1 8
2 8
3 nan
4 nan
5 nan
6 nan
7 nan
8 nan
9 nan
10 nan
11 nan
12 nan
The latter 10 being nan
because the price does not grow to greater than 1% of their cell in the remaining DataFrame.
Another option is to use numba (you can even easily parallelize the task):
import numba
@numba.njit(parallel=True)
def search(price, n, out):
for idx in numba.prange(len(price)):
p = price[idx]
search_for = p * n
for idx2, v in enumerate(price[idx:]):
if v >= search_for:
out[idx] = idx2
break
df["out"] = np.nan
search(df["price"].values, 1.01, df["out"].values)
print(df)
Prints:
price out
time
0 1490.47 9.0
1 1492.98 8.0
2 1494.69 8.0
3 1497.43 9.0
4 1499.02 NaN
5 1503.29 NaN
6 1501.60 NaN
7 1502.80 NaN
8 1502.30 NaN
9 1509.38 NaN
10 1512.01 NaN
11 1508.98 NaN
12 1512.63 NaN
On my AMD 5700x it took ~0.8seconds to compute dataframe with 1_000_000 np.random.uniform(1000, 2000, 1_000_000)
values.