I would like
Probably this is nothing new. I do just not find the right answers from other posts.
The example Dataframe:
import pandas as pd
df_GPS = pd.DataFrame([['2024-06-21 06:22:38', 22958, 605.968389, 1, 2, 1],
['2024-06-21 06:22:39', 22959, 606.009398, 3, 0, 1],
['2024-06-21 06:22:40', 22960, 605.630573, 1, 2, 0],
['2024-06-21 06:22:41', 22961, 605.476367, 3, 3, 0],
['2024-06-21 06:22:42', 22962, 605.322161, 2, 1, 1],
['2024-06-21 06:22:43', 22963, 605.268389, 4, 1, 0],
['2024-06-21 06:22:44', 22964, 605.559398, 1, 3, 1],
['2024-06-21 06:22:45', 22965, 606.630573, 2, 9 , 0],
['2024-06-21 06:22:46', 22966, 607.476367, 15, 13, 3],
['2024-06-21 06:22:47', 22967, 609.322161, 23, 19, 12],
['2024-06-21 06:22:48', 22968, 607.155939, 20, 21, 16],
['2024-06-21 06:22:49', 22969, 606.763057, 18, 14, 8],
['2024-06-21 06:22:50', 22970, 605.333781, 1, 1, 1],
['2024-06-21 06:22:50', 22971, 604.333781, 15, 1, 1]
], columns=['time', '__UTCs__','Altitude', 's01[m]', 's5.5[m]', 's10[m]'])
df_GPS
time __UTCs__ Altitude s01[m] s5.5[m] s10[m]
0 2024-06-21 06:22:38 22958 605.968389 1 2 1
1 2024-06-21 06:22:39 22959 606.009398 3 0 1
2 2024-06-21 06:22:40 22960 605.630573 1 2 0
3 2024-06-21 06:22:41 22961 605.476367 3 3 0
4 2024-06-21 06:22:42 22962 605.322161 2 1 1
5 2024-06-21 06:22:43 22963 605.268389 4 1 0
6 2024-06-21 06:22:44 22964 605.559398 1 3 1
7 2024-06-21 06:22:45 22965 606.630573 2 9 0
8 2024-06-21 06:22:46 22966 607.476367 15 13 3
9 2024-06-21 06:22:47 22967 609.322161 23 19 12
10 2024-06-21 06:22:48 22968 607.155939 20 21 16
11 2024-06-21 06:22:49 22969 606.763057 18 14 8
12 2024-06-21 06:22:50 22970 605.333781 1 1 1
13 2024-06-21 06:22:50 22971 604.333781 15 1 1
The result I am aiming at looks like:
time __UTCs__ Altitude s01[m] s5.5[m] s10[m]
1 2024-06-21 06:22:40 22960 605.630573 1 2 0
2 2024-06-21 06:22:41 22961 605.476367 3 3 0
3 2024-06-21 06:22:42 22962 605.322161 2 1 1
4 2024-06-21 06:22:43 22963 605.268389 4 1 0
5 2024-06-21 06:22:44 22964 605.559398 1 3 1
6 2024-06-21 06:22:45 22965 606.630573 2 9 0
7 2024-06-21 06:22:46 22966 607.476367 15 13 3
I tried with query
(what I thought should be the most elegant way):
df_sub = df_GPS.query('__UTCs__ >= 22960 & s01[m] < 16')
which gives an UndefinedVariableError: name 's01' is not defined
maybe due to the underlines or the brackets in the column names? How would I define that these are columns of df_GPS?
On the other side
df_sub = df_GPS[((df_GPS['__UTCs__'] >= 22960) & (df_GPS['s01[m]'] < 16))].copy()
Which results in:
time __UTCs__ Altitude s01[m] s5.5[m] s10[m]
2 2024-06-21 06:22:40 22960 605.630573 1 2 0
3 2024-06-21 06:22:41 22961 605.476367 3 3 0
4 2024-06-21 06:22:42 22962 605.322161 2 1 1
5 2024-06-21 06:22:43 22963 605.268389 4 1 0
6 2024-06-21 06:22:44 22964 605.559398 1 3 1
7 2024-06-21 06:22:45 22965 606.630573 2 9 0
8 2024-06-21 06:22:46 22966 607.476367 15 13 3
12 2024-06-21 06:22:50 22970 605.333781 1 1 1
13 2024-06-21 06:22:50 22971 604.333781 15 1 1
works in principle but leaves all rows meeting the last criterion. I want to stop the query after the first finding of all meeting criteria. Is there a way without undertaking a groupby of ['s01[m]']?
The last way I tried is with loc
. This would also reset the index but results in the same row content:
df_sub = df_GPS.loc[(df_GPS['__UTCs__'] >= 0) & (df_GPS['s01[m]'] <= 16)]
time __UTCs__ Altitude s01[m] s5.5[m] s10[m]
0 2024-06-21 06:22:38 22958 605.968389 1 2 1
1 2024-06-21 06:22:39 22959 606.009398 3 0 1
2 2024-06-21 06:22:40 22960 605.630573 1 2 0
3 2024-06-21 06:22:41 22961 605.476367 3 3 0
4 2024-06-21 06:22:42 22962 605.322161 2 1 1
5 2024-06-21 06:22:43 22963 605.268389 4 1 0
6 2024-06-21 06:22:44 22964 605.559398 1 3 1
7 2024-06-21 06:22:45 22965 606.630573 2 9 0
8 2024-06-21 06:22:46 22966 607.476367 15 13 3
12 2024-06-21 06:22:50 22970 605.333781 1 1 1
13 2024-06-21 06:22:50 22971 604.333781 15 1 1
How may I finish the query? with a while-loop?
You can use cummin
to compute your second condition:
df_GPS[df_GPS['__UTCs__'].ge(22960) & df_GPS['s01[m]'].lt(16).cummin()]
Output:
time __UTCs__ Altitude s01[m] s5.5[m] s10[m]
2 2024-06-21 06:22:40 22960 605.630573 1 2 0
3 2024-06-21 06:22:41 22961 605.476367 3 3 0
4 2024-06-21 06:22:42 22962 605.322161 2 1 1
5 2024-06-21 06:22:43 22963 605.268389 4 1 0
6 2024-06-21 06:22:44 22964 605.559398 1 3 1
7 2024-06-21 06:22:45 22965 606.630573 2 9 0
8 2024-06-21 06:22:46 22966 607.476367 15 13 3
Intermediates:
__UTCs__ s01[m] __UTCs__ >= 22960 s01[m] < 16 (s01[m] < 16).cummin() &
0 22958 1 False True True False
1 22959 3 False True True False
2 22960 1 True True True True
3 22961 3 True True True True
4 22962 2 True True True True
5 22963 4 True True True True
6 22964 1 True True True True
7 22965 2 True True True True
8 22966 15 True True True True
9 22967 23 True False False False
10 22968 20 True False False False
11 22969 18 True False False False
12 22970 1 True True False False
13 22971 15 True True False False
A potentially more robust approach if you have many conditions and want the first stretch of all True:
m = df_GPS['__UTCs__'].ge(22960) & df_GPS['s01[m]'].lt(16)
m2 = m.ne(m.shift(fill_value=m.iloc[0])).cumsum().eq(1) & m
out = df_GPS[m2]
Intermediates:
__UTCs__ s01[m] m shift ne cumsum eq(1) & m
0 22958 1 False False False 0 False False
1 22959 3 False False False 0 False False
2 22960 1 True False True 1 True True
3 22961 3 True True False 1 True True
4 22962 2 True True False 1 True True
5 22963 4 True True False 1 True True
6 22964 1 True True False 1 True True
7 22965 2 True True False 1 True True
8 22966 15 True True False 1 True True
9 22967 23 False True True 2 False False
10 22968 20 False False False 2 False False
11 22969 18 False False False 2 False False
12 22970 1 True False True 3 False False
13 22971 15 True True False 3 False False