I have a dataset like this where each row is player data:
>>> df.head()
game_size | match_id | party_size | player_assists | player_kills | player_name | team_id | team_placement | |
---|---|---|---|---|---|---|---|---|
0 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | SnuffIes | 4 | 18 |
1 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | Ozon3r | 4 | 18 |
2 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | bovize | 5 | 33 |
3 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | sbahn87 | 5 | 33 |
4 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 2 | GeminiZZZ | 14 | 11 |
Source: Full Dataset - Compressed 126MB, Decompressed 1.18GB
I need to create a new column called weights
where each row is a number between 0 and 1. It needs to be calculated as the total number of kills per player (player_kills
) divided by the total number of kill per team.
My initial thought was to create a new column called total_kills
from a groupby aggregation sum. The it's easy to create the weights
columns where each row is simply player_kills
divided by total_kills
. This is the code so far to calculate the groupby sum.
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
df = dd.read_csv("pubg.csv")
print(df.compute().head().to_markdown())
total_kills = df.groupby(
['match_id', 'team_id']
).aggregate({"player_kills": 'sum'}).reset_index()
print(total_kills.compute().head().to_markdown())
match_id | team_id | player_kills | |
---|---|---|---|
0 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 4 | 2 |
1 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 5 | 0 |
2 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 14 | 2 |
3 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 15 | 0 |
4 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 17 | 1 |
So far, so good. Trying to reassign the new player_kills
column back using this line of code doesn't work:
df['total_kills'] = total_kills['player_kills']
It produces this error:
Traceback (most recent call last):
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py", line 11, in <module>
df['total_kills'] = total_kills['player_kills']
~~^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 4952, in __setitem__
df = self.assign(**{key: value})
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 5401, in assign
data = elemwise(methods.assign, data, *pairs, meta=df2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\core.py", line 6505, in elemwise
args = _maybe_align_partitions(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 176, in _maybe_align_partitions
dfs2 = iter(align_partitions(*dfs)[0])
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\dataframe\multi.py", line 130, in align_partitions
raise ValueError(
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
How do I solve this problem?
I think dataframes shape are not same,so index issue is existed. You could try these;
total_kills = df.groupby(['match_id', 'team_id']).agg(player_total_kills=("player_kills", 'sum')).reset_index()
df_final = pd.merge(left=df,right=total_kills,on=["match_id","team_id"])
Btw, i didn't notice its dask question. But logic is same. I think, you need to merge/join dataframes by match_id and team id after aggregation.
I hope it works.