I have a table looks below.
lon | lat | output | |
---|---|---|---|
4050 | -47.812224 | -19.043365 | 1890.283215 |
5149 | -47.812224 | -19.043365 | 1890.283215 |
7316 | -47.812224 | -19.043365 | 1890.283215 |
8406 | -47.812224 | -19.043365 | 1890.283215 |
511 | -47.812014 | -19.007094 | 1813.785728 |
1555 | -47.812014 | -19.007094 | 1813.785728 |
3764 | -47.812014 | -19.007094 | 1821.363582 |
4846 | -47.812014 | -19.007094 | 1813.785728 |
29 | -47.811177 | -19.008053 | 1763.091936 |
1114 | -47.811177 | -19.008053 | 1763.091936 |
3262 | -47.811177 | -19.008053 | 1763.091936 |
4357 | -47.811177 | -19.008053 | 1763.091936 |
1436 | -47.774424 | -19.008700 | 2172.781911 |
2557 | -47.774424 | -19.008700 | 2174.394848 |
4725 | -47.774424 | -19.008700 | 2172.781911 |
5840 | -47.774424 | -19.008700 | 2172.781911 |
5211 | -47.774166 | -19.043847 | 2897.092502 |
6313 | -47.774166 | -19.043847 | 2897.092502 |
8460 | -47.774166 | -19.043847 | 2897.092502 |
9543 | -47.774166 | -19.043847 | 2897.092502 |
1679 | -47.773958 | -19.007574 | 2179.670924 |
2770 | -47.773958 | -19.007574 | 2179.670924 |
4998 | -47.773958 | -19.007574 | 2179.670924 |
6088 | -47.773958 | -19.007574 | 2179.670924 |
1937 | -47.773121 | -19.008533 | 2236.769862 |
3004 | -47.773121 | -19.008533 | 2236.769862 |
5231 | -47.773121 | -19.008533 | 2236.769862 |
6332 | -47.773121 | -19.008533 | 2236.769862 |
I would like to drop the duplicates by using groupby on lon and lat but keep the most repeated value on output
for example
lon | lat | output |
---|---|---|
-47.812224 | -19.043365 | 1890.283215 |
-47.812014 | -19.007094 | 1813.785728 |
-47.811177 | -19.008053 | 1763.091936 |
-47.774424 | -19.008700 | 2172.781911 |
-47.774166 | -19.043847 | 2897.092502 |
-47.773958 | -19.007574 | 2179.670924 |
-47.773121 | -19.008533 | 2236.769862 |
Could anyone tell me how to do this?
We can use the .groupby aggregate methods as an alternative to Andrej's method of using .apply and computing for each row.
While it does solve our problem .apply methods tend to become slower for a large dataset due to lack of vectorization.
Also, reset_index works much faster when used along with 'inplace=True'.
%%timeit
df.groupby(['lat', 'lon']).agg(pd.Series.mode).reset_index(inplace=True)
Hope that helps!!