pythonpandasdataframedaskdask-dataframe

How to get reassign column values from groupby.aggregrate back to original dataframe in dask?


I have a dataset like this where each row is player data:

>>> df.head()
game_size match_id party_size player_assists player_kills player_name team_id team_placement
0 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 1 SnuffIes 4 18
1 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 1 Ozon3r 4 18
2 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 0 bovize 5 33
3 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 0 sbahn87 5 33
4 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 2 GeminiZZZ 14 11

Source: Full Dataset - Compressed 126MB, Decompressed 1.18GB

I need to create a new column called weights where each row is a number between 0 and 1. It needs to be calculated as the total number of kills per player (player_kills) divided by the total number of kill per team.

My Attempt

My initial thought was to create a new column called total_kills from a groupby aggregation sum. The it's easy to create the weights columns where each row is simply player_kills divided by total_kills. This is the code so far to calculate the groupby sum.

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

df = dd.read_csv("pubg.csv")
print(df.compute().head().to_markdown())
total_kills = df.groupby(
    ['match_id', 'team_id']
).aggregate({"player_kills": 'sum'}).reset_index()
print(total_kills.compute().head().to_markdown())
match_id team_id player_kills
0 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 4 2
1 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 5 0
2 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 14 2
3 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 15 0
4 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 17 1

So far, so good. Trying to reassign the new player_kills column back using this line of code doesn't work:

df['total_kills'] = total_kills['player_kills']

It produces this error:

Traceback (most recent call last):
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py", line 11, in <module>
    df['total_kills'] = total_kills['player_kills']
    ~~^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 4952, in __setitem__
    df = self.assign(**{key: value})
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 5401, in assign
    data = elemwise(methods.assign, data, *pairs, meta=df2)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 6505, in elemwise
    args = _maybe_align_partitions(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 176, in _maybe_align_partitions
    dfs2 = iter(align_partitions(*dfs)[0])
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 130, in align_partitions
    raise ValueError(
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

How do I solve this problem?


Solution

  • I think dataframes shape are not same,so index issue is existed. You could try these;

    total_kills = df.groupby(['match_id', 'team_id']).agg(player_total_kills=("player_kills", 'sum')).reset_index()
    df_final = pd.merge(left=df,right=total_kills,on=["match_id","team_id"])
    

    Btw, i didn't notice its dask question. But logic is same. I think, you need to merge/join dataframes by match_id and team id after aggregation.

    I hope it works.