pythonpython-3.xpandasvectorization

Pandas vectorized function to find time to grow n% from starting cell


I have a pandas DataFrame with a time series (in this case price of a used car model) and am looking for a vectorized function to map each cell to the time it takes the price to grow n percent from that cell (if it never reaches n% more then return nan)

It should in theory be possible to execute in a vectorized way as the output of each row is independent of what comes before it

Here is a sample of the data and expected output

import numpy as np
import pandas as pd
import datetime


df = pd.DataFrame(
    [
        1490.47,
        1492.98,
        1494.69,
        1497.43,
        1499.02,
        1503.29,
        1501.60,
        1502.80,
        1502.30,
        1509.38,
        1512.01,
        1508.98,
        1512.63,
    ],
    columns=['price'],
)

df.index.names = ['time']

n=1/100

So in this case I want to run for n=1/100 so compute for each cell the time it takes the price to increase by 1%. So for the first cell, 1% grows would be at 1490.47*1.01 = 1505.3747, the first cell greater than this value is 1509.38 which is 9 cells after the first cell so the output for that cell would be 9, and so on for the rest...

Expected output would then be:

df.some_functions(n=1/100)

print(df)

        price
time         
0     9
1     8
2     8
3     nan
4     nan
5     nan
6     nan
7     nan
8     nan
9     nan
10    nan
11    nan
12    nan

The latter 10 being nan because the price does not grow to greater than 1% of their cell in the remaining DataFrame.


Solution

  • Another option is to use (you can even easily parallelize the task):

    import numba
    
    
    @numba.njit(parallel=True)
    def search(price, n, out):
        for idx in numba.prange(len(price)):
            p = price[idx]
            search_for = p * n
            for idx2, v in enumerate(price[idx:]):
                if v >= search_for:
                    out[idx] = idx2
                    break
    
    
    df["out"] = np.nan
    search(df["price"].values, 1.01, df["out"].values)
    
    print(df)
    

    Prints:

            price  out
    time              
    0     1490.47  9.0
    1     1492.98  8.0
    2     1494.69  8.0
    3     1497.43  9.0
    4     1499.02  NaN
    5     1503.29  NaN
    6     1501.60  NaN
    7     1502.80  NaN
    8     1502.30  NaN
    9     1509.38  NaN
    10    1512.01  NaN
    11    1508.98  NaN
    12    1512.63  NaN
    

    On my AMD 5700x it took ~0.8seconds to compute dataframe with 1_000_000 np.random.uniform(1000, 2000, 1_000_000) values.