:) Hi everyone!
I wanted to do something very specific. I have some accelerometer data with A LOT of gaps due to hardware malfunctions, and I wanted to artificially fill in those gaps so I can do a Fourier Transform and study the signal.
This is a sample of the data that I'm working with:
import pandas as pd
df = pd.DataFrame({'Time': [0, 100085, 200170, 300255, 400338],
'Value_X': [1.30, 1.32, 1.26, 1.33, 1.28],
'Value_Y': [-0.47, -0.49, -0.48, -0.44, -0.48],
'Value_Z': [9.66, 9.74, 9.71, 9.71, 9.72]})
and I made a custom_round
function to round the column Time
up to multiples of 100000, like this:
import math
def custom_round(x, base):
return int(base * math.ceil(float(x)/base))
df['Time_rounded'] = df['Time'].apply(lambda x: custom_round(x=x, base=100000))
So, the data looks like this:
Time Value_X Value_Y Value_Z Time_rounded
0 0 1.30 -0.47 9.66 0
1 100085 1.32 -0.49 9.74 200000
2 200170 1.26 -0.48 9.71 300000
3 300255 1.33 -0.44 9.71 400000
4 400338 1.28 -0.48 9.72 500000
There are some multiples of 100000 that are not on the dataframe, as we can see in the sample, and I wanted to do a task that I assumed was easy to do, but, apparently, it's not. As this can be a very hard task, I divided it into medium task and hard task, because I would be pretty happy if I could do the medium task, but I would prefer to do the hard task if possible.
I created a multiples_list
, like this:
max_scaled_down = int(df['Time_rounded'].max() / 100000) # dividing max by 100000
multiples_list = list(range(max_scaled_down)) # [0, 1, 2, 3, 4, 5, ...]
multiples_list = [number*100000 for number in multiples_list] # [0, 100000, 200000, 300000, ...]
MEDIUM TASK:
To do this:
Time Value_X Value_Y Value_Z Time_rounded
0 0 1.30 -0.47 9.66 0
1 0 1.30 -0.47 9.66 100000 <-- to insert rows like this
2 100085 1.32 -0.49 9.74 200000
3 200170 1.26 -0.48 9.71 300000
4 300255 1.33 -0.44 9.71 400000
5 400338 1.28 -0.48 9.72 500000
Whenever there is a number on the multiples_list
that is not in the column Time_rounded
, I wanted to insert a row, that looks exactly like the previous row, except for the Time_rounded
value.
AND I wanted that to happen even if there were multiple consecutive "gaps", there would be multiple consecutive duplicate rows, like:
Time Value_X Value_Y Value_Z Time_rounded
0 0 1.30 -0.47 9.66 0
1 0 1.30 -0.47 9.66 100000 <-- to insert rows like this
2 0 1.30 -0.47 9.66 200000 <-- to insert rows like this
3 100085 1.32 -0.49 9.74 300000
4 200170 1.26 -0.48 9.71 400000
5 300255 1.33 -0.44 9.71 500000
6 400338 1.28 -0.48 9.72 600000
HARD TASK:
I'm going to give 2 examples of what I ideally would want:
Time Value_X Value_Y Value_Z Time_rounded
0 0 1.30 -0.47 9.66 0
1 0 1.31 -0.48 9.70 100000 <-- to insert rows like this
2 100085 1.32 -0.49 9.74 200000
Time Value_X Value_Y Value_Z Time_rounded
0 0 1.00 -0.50 7.00 100000 <-- original row
1 25021.25 1.25 -2.75 7.50 200000 <-- to insert rows like this
2 50042.50 1.50 -5.00 8.00 300000 <-- to insert rows like this
3 75063.75 1.75 -7.25 8.50 400000 <-- to insert rows like this
4 100085 2.00 -9.50 9.00 500000 <-- original row
Whenever there is a number on the multiples_list
that is not in the column Time_rounded
, I wanted to insert rows in the gaps, where the values are proportional to the original rows.
What I've tried:
I tried to add every element to a separate list, so I could later build another code to add them to the list. This was extremely time-consuming (it took about 32 hours to run, since the dataframe has about a million rows) and not very efficient. It also didn't work how I wanted.
time_rounded = []
value_x = []
value_y = []
value_z = []
for i in range(len(multiples_list)):
if multiples_list[i] not in df['Time_rounded'].values: # if the rounded time is not in the Time_rounded column
print(round(multiples_list[i]/multiples_list[-1], ndigits=4)) #to check progress in terminal
for j in multiples_list[i:]: #to check on values after
if j in df['Time_rounded'].values: #to check on the next value of the list that exists on the Time_rounded column
next_value = j
break
previous_time = multiples_list[i-1]
next_time = next_value
current_time = multiples_list[i]
previous_x = df['Value_X'][df['Time_rounded']==previous_time].values[0]
previous_y = df['Value_Y'][df['Time_rounded']==previous_time].values[0]
previous_z = df['Value_Z'][df['Time_rounded']==previous_time].values[0]
next_x = df['Value_X'][df['Time_rounded']==next_time].values[0]
next_y = df['Value_Y'][df['Time_rounded']==next_time].values[0]
next_z = df['Value_Z'][df['Time_rounded']==next_time].values[0]
avg_x = (((next_time-previous_time)/(current_time-previous_time)) * (next_x - previous_x)) + previous_x
avg_y = (((next_time-previous_time)/(current_time-previous_time)) * (next_y - previous_y)) + previous_y
avg_z = (((next_time-previous_time)/(current_time-previous_time)) * (next_z - previous_z)) + previous_z
value_x.append(avg_x)
value_y.append(avg_y)
value_z.append(avg_z)
time_rounded.append(multiples_list[current_time])
This code has an error. Let's say that the value 500000 doesn't exist in Time_rounded
, and 600000 doesn't exist either. After my code takes care of 500000 and does the same operations on number 600000, it is using 500000 as previous_time
, building an error on this line, because this row is empty:
previous_x = df['Value_X'][df['Time_rounded']==previous_time].values[0]
the error is this:
IndexError: index 0 is out of bounds for axis 0 with size 0
Before there was this error, I ran this code without any of the .values[0]
, and it ran for 32 hours, and most of the results on the list were NaN
. When I added the .values[0]
I got this error.
Sorry for all the information, but I wanted to build a reproducible question so I could get better help. Thank you in advance to everyone that helps in any way. :)
I'm going to start with a slightly modified copy of your data, to better illustrate how each solution handles a gap of multiples rows.
Here's the starting data:
import pandas as pd
import math
df = pd.DataFrame({'Time': [0, 200085, 300170, 400255, 500338],
'Value_X': [1.30, 1.32, 1.26, 1.33, 1.28],
'Value_Y': [-0.47, -0.49, -0.48, -0.44, -0.48],
'Value_Z': [9.66, 9.74, 9.71, 9.71, 9.72]})
def custom_round(x, base):
return int(base * math.ceil(float(x)/base))
df['Time_rounded'] = df['Time'].apply(lambda x: custom_round(x=x, base=100000))
Here's the dataframe that this code creates:
Time Value_X Value_Y Value_Z Time_rounded
0 0 1.30 -0.47 9.66 0
1 200085 1.32 -0.49 9.74 300000
2 300170 1.26 -0.48 9.71 400000
3 400255 1.33 -0.44 9.71 500000
4 500338 1.28 -0.48 9.72 600000
MEDIUM TASK:
Sounds like you want a merge followed by a forward-fill. Here's an example of how to do that.
step = 100000
max_time = df['Time_rounded'].max()
all_timesteps = pd.DataFrame({'Time_rounded': range(0, max_time + 1, step)})
all_timesteps = all_timesteps.merge(df, how='left')
ffill_cols = ['Value_X', 'Value_Y', 'Value_Z']
all_timesteps.loc[:, ffill_cols] = all_timesteps.loc[:, ffill_cols].ffill()
all_timesteps['Time'] = all_timesteps['Time'].fillna(all_timesteps['Time_rounded'])
print(all_timesteps)
Output:
Time_rounded Time Value_X Value_Y Value_Z
0 0 0.0 1.30 -0.47 9.66
1 100000 100000.0 1.30 -0.47 9.66
2 200000 200000.0 1.30 -0.47 9.66
3 300000 200085.0 1.32 -0.49 9.74
4 400000 300170.0 1.26 -0.48 9.71
5 500000 400255.0 1.33 -0.44 9.71
6 600000 500338.0 1.28 -0.48 9.72
Explanation:
range()
. Use the max plus 1 because range is not inclusive.HARD TASK:
Sounds like you want the same thing as before, but with linear interpolation instead of forward fill. Fortunately, Pandas has a method, .interpolate()
, which makes this easy.
step = 100000
max_time = df['Time_rounded'].max()
all_timesteps = pd.DataFrame({'Time_rounded': range(0, max_time + 1, step)})
all_timesteps = all_timesteps.merge(df, how='left')
interp_cols = ['Value_X', 'Value_Y', 'Value_Z', 'Time']
all_timesteps.loc[:, interp_cols] = all_timesteps.loc[:, interp_cols].interpolate()
print(all_timesteps)
Output:
Time_rounded Time Value_X Value_Y Value_Z
0 0 0.0 1.300000 -0.470000 9.660000
1 100000 66695.0 1.306667 -0.476667 9.686667
2 200000 133390.0 1.313333 -0.483333 9.713333
3 300000 200085.0 1.320000 -0.490000 9.740000
4 400000 300170.0 1.260000 -0.480000 9.710000
5 500000 400255.0 1.330000 -0.440000 9.710000
6 600000 500338.0 1.280000 -0.480000 9.720000