pythonpandasdataframeaccelerometer

Add rows to pandas dataframe in loop


:) Hi everyone!

I wanted to do something very specific. I have some accelerometer data with A LOT of gaps due to hardware malfunctions, and I wanted to artificially fill in those gaps so I can do a Fourier Transform and study the signal.

This is a sample of the data that I'm working with:

import pandas as pd
df = pd.DataFrame({'Time': [0, 100085, 200170, 300255, 400338], 
                   'Value_X': [1.30, 1.32, 1.26, 1.33, 1.28],
                   'Value_Y': [-0.47, -0.49, -0.48, -0.44, -0.48],
                   'Value_Z': [9.66, 9.74, 9.71, 9.71, 9.72]})

and I made a custom_round function to round the column Time up to multiples of 100000, like this:

import math
def custom_round(x, base):
    return int(base * math.ceil(float(x)/base))

df['Time_rounded'] = df['Time'].apply(lambda x: custom_round(x=x, base=100000))

So, the data looks like this:

     Time    Value_X  Value_Y  Value_Z  Time_rounded
0       0       1.30    -0.47     9.66             0
1  100085       1.32    -0.49     9.74        200000
2  200170       1.26    -0.48     9.71        300000
3  300255       1.33    -0.44     9.71        400000
4  400338       1.28    -0.48     9.72        500000

There are some multiples of 100000 that are not on the dataframe, as we can see in the sample, and I wanted to do a task that I assumed was easy to do, but, apparently, it's not. As this can be a very hard task, I divided it into medium task and hard task, because I would be pretty happy if I could do the medium task, but I would prefer to do the hard task if possible.

I created a multiples_list, like this:

max_scaled_down = int(df['Time_rounded'].max() / 100000) # dividing max by 100000

multiples_list = list(range(max_scaled_down)) # [0, 1, 2, 3, 4, 5, ...]
multiples_list = [number*100000 for number in multiples_list] # [0, 100000, 200000, 300000, ...]

MEDIUM TASK:

To do this:

     Time    Value_X  Value_Y  Value_Z  Time_rounded
0       0       1.30    -0.47     9.66             0
1       0       1.30    -0.47     9.66        100000 <-- to insert rows like this
2  100085       1.32    -0.49     9.74        200000
3  200170       1.26    -0.48     9.71        300000
4  300255       1.33    -0.44     9.71        400000
5  400338       1.28    -0.48     9.72        500000

Whenever there is a number on the multiples_list that is not in the column Time_rounded, I wanted to insert a row, that looks exactly like the previous row, except for the Time_rounded value.

AND I wanted that to happen even if there were multiple consecutive "gaps", there would be multiple consecutive duplicate rows, like:

     Time    Value_X  Value_Y  Value_Z  Time_rounded
0       0       1.30    -0.47     9.66             0
1       0       1.30    -0.47     9.66        100000 <-- to insert rows like this
2       0       1.30    -0.47     9.66        200000 <-- to insert rows like this
3  100085       1.32    -0.49     9.74        300000
4  200170       1.26    -0.48     9.71        400000
5  300255       1.33    -0.44     9.71        500000
6  400338       1.28    -0.48     9.72        600000

HARD TASK:

I'm going to give 2 examples of what I ideally would want:

     Time    Value_X  Value_Y  Value_Z  Time_rounded
0       0       1.30    -0.47     9.66             0
1       0       1.31    -0.48     9.70        100000 <-- to insert rows like this
2  100085       1.32    -0.49     9.74        200000
       Time    Value_X  Value_Y  Value_Z  Time_rounded
0         0       1.00    -0.50     7.00        100000 <-- original row
1  25021.25       1.25    -2.75     7.50        200000 <-- to insert rows like this
2  50042.50       1.50    -5.00     8.00        300000 <-- to insert rows like this
3  75063.75       1.75    -7.25     8.50        400000 <-- to insert rows like this
4    100085       2.00    -9.50     9.00        500000 <-- original row

Whenever there is a number on the multiples_list that is not in the column Time_rounded, I wanted to insert rows in the gaps, where the values are proportional to the original rows.

What I've tried:

I tried to add every element to a separate list, so I could later build another code to add them to the list. This was extremely time-consuming (it took about 32 hours to run, since the dataframe has about a million rows) and not very efficient. It also didn't work how I wanted.

time_rounded = []
value_x = []
value_y = []
value_z = []

for i in range(len(multiples_list)):
    
    if multiples_list[i] not in df['Time_rounded'].values: # if the rounded time is not in the Time_rounded column

        print(round(multiples_list[i]/multiples_list[-1], ndigits=4)) #to check progress in terminal
        
        for j in multiples_list[i:]: #to check on values after
            
            if j in df['Time_rounded'].values: #to check on the next value of the list that exists on the Time_rounded column
                
                next_value = j
                break
        
        previous_time = multiples_list[i-1]  
        next_time = next_value
        current_time = multiples_list[i]
        
        previous_x = df['Value_X'][df['Time_rounded']==previous_time].values[0]
        previous_y = df['Value_Y'][df['Time_rounded']==previous_time].values[0]
        previous_z = df['Value_Z'][df['Time_rounded']==previous_time].values[0]
                
        next_x = df['Value_X'][df['Time_rounded']==next_time].values[0]
        next_y = df['Value_Y'][df['Time_rounded']==next_time].values[0]
        next_z = df['Value_Z'][df['Time_rounded']==next_time].values[0]
        
        avg_x = (((next_time-previous_time)/(current_time-previous_time)) * (next_x - previous_x)) + previous_x
        avg_y = (((next_time-previous_time)/(current_time-previous_time)) * (next_y - previous_y)) + previous_y
        avg_z = (((next_time-previous_time)/(current_time-previous_time)) * (next_z - previous_z)) + previous_z      
        
        value_x.append(avg_x) 
        value_y.append(avg_y)
        value_z.append(avg_z)
        time_rounded.append(multiples_list[current_time]) 

This code has an error. Let's say that the value 500000 doesn't exist in Time_rounded, and 600000 doesn't exist either. After my code takes care of 500000 and does the same operations on number 600000, it is using 500000 as previous_time, building an error on this line, because this row is empty:

previous_x = df['Value_X'][df['Time_rounded']==previous_time].values[0]

the error is this:

IndexError: index 0 is out of bounds for axis 0 with size 0

Before there was this error, I ran this code without any of the .values[0], and it ran for 32 hours, and most of the results on the list were NaN. When I added the .values[0] I got this error.

Sorry for all the information, but I wanted to build a reproducible question so I could get better help. Thank you in advance to everyone that helps in any way. :)


Solution

  • I'm going to start with a slightly modified copy of your data, to better illustrate how each solution handles a gap of multiples rows.

    Here's the starting data:

    import pandas as pd
    import math
    df = pd.DataFrame({'Time': [0, 200085, 300170, 400255, 500338], 
                       'Value_X': [1.30, 1.32, 1.26, 1.33, 1.28],
                       'Value_Y': [-0.47, -0.49, -0.48, -0.44, -0.48],
                       'Value_Z': [9.66, 9.74, 9.71, 9.71, 9.72]})
    
    def custom_round(x, base):
        return int(base * math.ceil(float(x)/base))
    
    df['Time_rounded'] = df['Time'].apply(lambda x: custom_round(x=x, base=100000))
    

    Here's the dataframe that this code creates:

         Time  Value_X  Value_Y  Value_Z  Time_rounded
    0       0     1.30    -0.47     9.66             0
    1  200085     1.32    -0.49     9.74        300000
    2  300170     1.26    -0.48     9.71        400000
    3  400255     1.33    -0.44     9.71        500000
    4  500338     1.28    -0.48     9.72        600000
    

    MEDIUM TASK:

    Sounds like you want a merge followed by a forward-fill. Here's an example of how to do that.

    step = 100000
    max_time = df['Time_rounded'].max()
    all_timesteps = pd.DataFrame({'Time_rounded': range(0, max_time + 1, step)})
    all_timesteps = all_timesteps.merge(df, how='left')
    ffill_cols = ['Value_X', 'Value_Y', 'Value_Z']
    all_timesteps.loc[:, ffill_cols] = all_timesteps.loc[:, ffill_cols].ffill()
    all_timesteps['Time'] = all_timesteps['Time'].fillna(all_timesteps['Time_rounded'])
    print(all_timesteps)
    

    Output:

       Time_rounded      Time  Value_X  Value_Y  Value_Z
    0             0       0.0     1.30    -0.47     9.66
    1        100000  100000.0     1.30    -0.47     9.66
    2        200000  200000.0     1.30    -0.47     9.66
    3        300000  200085.0     1.32    -0.49     9.74
    4        400000  300170.0     1.26    -0.48     9.71
    5        500000  400255.0     1.33    -0.44     9.71
    6        600000  500338.0     1.28    -0.48     9.72
    

    Explanation:

    HARD TASK:

    Sounds like you want the same thing as before, but with linear interpolation instead of forward fill. Fortunately, Pandas has a method, .interpolate(), which makes this easy.

    step = 100000
    max_time = df['Time_rounded'].max()
    all_timesteps = pd.DataFrame({'Time_rounded': range(0, max_time + 1, step)})
    all_timesteps = all_timesteps.merge(df, how='left')
    interp_cols = ['Value_X', 'Value_Y', 'Value_Z', 'Time']
    all_timesteps.loc[:, interp_cols] = all_timesteps.loc[:, interp_cols].interpolate()
    print(all_timesteps)
    

    Output:

       Time_rounded      Time   Value_X   Value_Y   Value_Z
    0             0       0.0  1.300000 -0.470000  9.660000
    1        100000   66695.0  1.306667 -0.476667  9.686667
    2        200000  133390.0  1.313333 -0.483333  9.713333
    3        300000  200085.0  1.320000 -0.490000  9.740000
    4        400000  300170.0  1.260000 -0.480000  9.710000
    5        500000  400255.0  1.330000 -0.440000  9.710000
    6        600000  500338.0  1.280000 -0.480000  9.720000