I've got sequence(s) of points with coordinates and other attributes that I'd like to convert to lines (shapely
LineString
s).
Pandas DataFrame looks a like this:
Path locIdx Arr Dep PostLength Long Lat geometry
0 32613 1 NaT 05:00:00 219.0326572 -1.3473 53.9396 POINT (-1.3473 53.9396)
1 32613 2 05:02:00 05:02:00 181.020583 -1.3433 53.9338 POINT (-1.3433 53.9338)
2 32613 3 05:03:00 05:03:00 440.4625762 -1.3435 53.9322 POINT (-1.3435 53.9322)
3 32613 4 05:05:00 05:05:00 551.3486222 -1.3454 53.9285 POINT (-1.3454 53.9285)
4 32613 5 05:06:00 05:06:00 575.912064 -1.347 53.9272 POINT (-1.347 53.9272)
5 32613 6 05:07:00 NaT nan -1.3519 53.9299 POINT (-1.3519 53.9299)
Conversion to lines would obviously include 1 line less than number of points in sequence, but I'd like to keep point attributes (like PostLength
) and also calculate some additional (like timeDiff
= Arr
- Dep
, based on "next item Arr
" attribute).
The ideas I got include duplication of each record (row) and grouping them with something like described here:
geo_df = geo_df.groupby(['Path', 'locIdx'])['geometry'].apply(lambda x: LineString(x.tolist()))
But this solution doesn't seem ideal (especially when calculating difference of two attributes from different rows) and I'm having a feeling that something better can be done. Maybe I should just iterate DataFrame?
You can use a couple of options. In this solution I make use of itertool.pairwise
, the .diff()
method.
To recreate your data:
data = {
'Path': [32613, 32613, 32613, 32613, 32613, 32613],
'locIdx': [1, 2, 3, 4, 5, 6],
'Arr': ['05:00:00', '05:02:00', '05:03:00', '05:05:00', '05:06:00', '05:07:00'],
'PostLength': [219.0326572, 181.020583, 440.4625762, 551.3486222, 575.912064, None],
'Long': [-1.3473, -1.3433, -1.3435, -1.3454, -1.347, -1.3519],
'Lat': [53.9396, 53.9338, 53.9322, 53.9285, 53.9272, 53.9299],
}
df = pd.DataFrame(data)
geometry = [Point(xy) for xy in zip(df['Long'], df['Lat'])]
geo_df = gpd.GeoDataFrame(df, geometry=geometry)
And use this to get the desired result:
gdf_diff = gpd.GeoDataFrame({"timeDiff ": pd.to_datetime(geo_df.Arr).diff()[1:].reset_index(drop=True),
"PostLength": geo_df["PostLength"].iloc[:-1]},
geometry=[shapely.LineString(points) for points in itertools.pairwise(geo_df.geometry)])