dask-dataframeh3

apply h3.string_to_h3 on dask.dataframe thru map_partitions


I would like to ask how to use dd.map_partitions for h3.string_to_h3 function. my dataframe looks like this

h3 lat lon x y elevation
2 8ca80c8e91015ff -23.068134 -52.042272 393235.906794 7.448557e+06
3 8ca80c8ecadd1ff -23.095896 -52.031107 394401.401086 7.445492e+06
4 8ca80cbb455b1ff -23.052007 -52.055948 391822.030340 7.450333e+06
5 8ca80cbb6a06dff -23.045227 -52.049591 392468.007662 7.451088e+06
6 8ca80c85876e9ff -23.077720 -52.085169 388849.315388 7.447464e+06

If this is pandas, I can simply using apply function to get hexagon index, df['h3'].apply(h3.string_to_h3). But how if I have a large dataset and would like to use dd.map_partitions?

I have tried df['h3'].apply(h3.string_to_h3), df['h3'].map_partitions(h3.string_to_h3, meta={'hexagons':'int64'}), and df['h3'].map_partitions(h3.string_to_h3, axis=1, meta={'hexagons':'int64'}). None of them are working.

Could someone here told me how to resolve this issue?

Thanks


Solution

  • I think map_partitions does what it says on the tin - that is, it applies a mapping function that accepts a partition dataframe as input. You can then manipulate the partition itself inside that function.

    I haven't tested the code below, but I believe this should work:

    df['h3'] = df.map_partitions(
      lambda partition: partition['h3'].apply(h3.string_to_h3),
      meta=('h3', np.uint64),
    )