group-bygeopandasshapelymultilinestring

Convert sequence of points (pandas df) to lines


I've got sequence(s) of points with coordinates and other attributes that I'd like to convert to lines (shapely LineStrings). Pandas DataFrame looks a like this:

    Path  locIdx  Arr        Dep        PostLength  Long    Lat     geometry
0   32613   1   NaT         05:00:00    219.0326572 -1.3473 53.9396 POINT (-1.3473 53.9396)
1   32613   2   05:02:00    05:02:00    181.020583  -1.3433 53.9338 POINT (-1.3433 53.9338)
2   32613   3   05:03:00    05:03:00    440.4625762 -1.3435 53.9322 POINT (-1.3435 53.9322)
3   32613   4   05:05:00    05:05:00    551.3486222 -1.3454 53.9285 POINT (-1.3454 53.9285)
4   32613   5   05:06:00    05:06:00    575.912064  -1.347  53.9272 POINT (-1.347 53.9272)
5   32613   6   05:07:00    NaT         nan         -1.3519 53.9299 POINT (-1.3519 53.9299)

Conversion to lines would obviously include 1 line less than number of points in sequence, but I'd like to keep point attributes (like PostLength) and also calculate some additional (like timeDiff = Arr - Dep, based on "next item Arr" attribute).

The ideas I got include duplication of each record (row) and grouping them with something like described here:

geo_df = geo_df.groupby(['Path', 'locIdx'])['geometry'].apply(lambda x: LineString(x.tolist()))

But this solution doesn't seem ideal (especially when calculating difference of two attributes from different rows) and I'm having a feeling that something better can be done. Maybe I should just iterate DataFrame?


Solution

  • You can use a couple of options. In this solution I make use of itertool.pairwise, the .diff() method.

    To recreate your data:

    data = {
        'Path': [32613, 32613, 32613, 32613, 32613, 32613],
        'locIdx': [1, 2, 3, 4, 5, 6],
        'Arr': ['05:00:00', '05:02:00', '05:03:00', '05:05:00', '05:06:00', '05:07:00'],
        'PostLength': [219.0326572, 181.020583, 440.4625762, 551.3486222, 575.912064, None],
        'Long': [-1.3473, -1.3433, -1.3435, -1.3454, -1.347, -1.3519],
        'Lat': [53.9396, 53.9338, 53.9322, 53.9285, 53.9272, 53.9299],
    }
    df = pd.DataFrame(data)
    
    geometry = [Point(xy) for xy in zip(df['Long'], df['Lat'])]
    geo_df = gpd.GeoDataFrame(df, geometry=geometry)
    

    And use this to get the desired result:

    gdf_diff = gpd.GeoDataFrame({"timeDiff ": pd.to_datetime(geo_df.Arr).diff()[1:].reset_index(drop=True),
                                 "PostLength": geo_df["PostLength"].iloc[:-1]},
                                geometry=[shapely.LineString(points) for points in itertools.pairwise(geo_df.geometry)])