I have three NumPy arrays: class_label, student_name, student_id and grade. I am aggregating the grade per class and name as follows:
import numpy as np
import numpy_groupies as npg
class_label = [0, 2, 1, 1, 2, 3, 0, 1]
student_name = [0, 0, 1, 2, 1, 1, 0, 1]
student_id = [0, 1, 2, 3, 4, 5, 6, 7]
grade = [10, 5, 7, 4, 9, 8, 9, 3]
n_classes = 4
n_names = 3
group_idx = np.vstack((class_label, student_name))
npg.aggregate(group_idx, grade, size=(4,3))
which returns:
array([[19, 0, 0],
[ 0, 10, 4],
[ 5, 9, 0],
[ 0, 8, 0]])
I am trying now to get the ids involved in this grouping (adding very minimum overhead to the code if possible)
I want to build a 2D array of shape (n_classes, n_names) in which each cell is a list of all student_id values that share the same (class_label, student_name) pair. For example:
I expect the result to look like this (a 4×3 object array of lists):
array([[ [0, 6], [], [] ],
[ [], [2, 7], [3 ],
[ [1], [4], [] ],
[ [], [5], [] ]], dtype=object)
in the final 2D output array, at row 0 (class 0) and column 0 (name 0), I expect the list [0, 6]. In other words, the cell at position [0,0] must be a list containing the IDs for all records whose class_label == 0 and student_name == 0.
You should use pandas here:
import pandas as pd
df = pd.DataFrame({'class': class_label,
'name': student_name,
'id': student_id,
'grade': grade,
})
which creates the following DataFrame:
class name id grade
0 0 0 0 10
1 2 0 1 5
2 1 1 2 7
3 1 2 3 4
4 2 1 4 9
5 3 1 5 8
6 0 0 6 9
7 1 1 7 3
Then use a pivot_table
with custom aggfunc
:
out = df.pivot_table(index='class', columns='name',
aggfunc={'grade': 'sum', 'id': list})
Output:
grade id
name 0 1 2 0 1 2
class
0 19.0 NaN NaN [0, 6] NaN NaN
1 NaN 10.0 4.0 NaN [2, 7] [3]
2 5.0 9.0 NaN [1] [4] NaN
3 NaN 8.0 NaN NaN [5] NaN
Or independently:
grades = df.pivot_table(index='class', columns='name', values='grade',
aggfunc='sum', fill_value=0)
indices = df.pivot_table(index='class', columns='name', values='id',
aggfunc=list)
Note the npg.aggregate_np
could also work:
npg.aggregate_np(group_idx, student_id, size=(4,3), func=list, dtype=object)
array([[array([0, 6]), 0, 0],
[0, array([2, 7]), array([3])],
[array([1]), array([4]), 0],
[0, array([5]), 0]], dtype=object)