python-3.xpandaspandas-groupbyvaex

Pandas how to bin and groupby without categorical range of values


I have a large number of latitude and longitude values that I would like to bin together in order to display them on a heatmap (ipyleaflet only seems to allow 2000 or so points in the heatmap and this would also be much more efficient when using big data).

I am actually using vaex, but a answer in terms of pandas is also fine.

The pandas pd.cut function seems to be helpful in terms of binning, however it produces a categorical column (category dtype) which looks like a list of all the values in the bin. Is there some way of changing this to just be an incremental number identifying each bin (thanks jezreal for that part of the answer)? I just need a bin number in order to then groupby the bin number and average (mean) on the latitude and longitude columns. I also need a count for the intensity of the heatmap entry.

For example:

dft = pd.DataFrame({
    'latitude': [1.5, 0.5, 1.2, 0.9, 3],
    'longitude': [3, 0.2, 2, 0.2, 1.1]
    })

dft['bin'] = pd.cut(dft['latitude'], bins=3, labels=False).astype(str) + "_" + pd.cut(dft['longitude'], bins=3, labels=False).astype(str)

dft.groupby('bin').agg(['mean', 'count']).unstack()

Almost gives me the answer, but I think I want this output instead:

bin latitude_mean longitude_mean count
0_0 0.7           0.2            2
0_1 1.2           2.0            1
1_2 1.5           3.0            1
2_0 3.0           1.1            1 

It would be helpful if the count can be normalized between 1 and 1000.

How can I use pandas pd.cut or something else to groupby the bin in the rows, average on the latitude and longitude and (heatmap intensity) count in the columns?


Solution

  • The pandas pd.cut function seems to be helpful in terms of binning, however it produces a categorical column (category dtype) which looks like a list of all the values in the bin. Is there some way of changing this to just be an incremental number identifying each bin

    Yes, use label=False parameter in cut:

    labels array or False, default None
    Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins.

    Last use GroupBy.agg for aggregation and last normalize count column:

    df = dft.groupby('bin').agg(latitude_mean=('latitude','mean'),
                                longitude_mean=('longitude','mean'),
                                count=('latitude','count'))
    
    #https://stackoverflow.com/a/50028155/2901002
    a, b = 1, 1000
    x, y = df['count'].min(),df['count'].max()
    df['count'] = (df['count'] - x) / (y - x) * (b - a) + a
    
    print (df)
    
         latitude_mean  longitude_mean   count
    bin                                       
    0_0            0.7             0.2  1000.0
    0_1            1.2             2.0     1.0
    1_2            1.5             3.0     1.0
    2_0            3.0             1.1     1.0