Efficiently group IDs by two keys into a 2D array of lists

I have three NumPy arrays: class_label, student_name, student_id and grade. I am aggregating the grade per class and name as follows:

import numpy as np
import numpy_groupies as npg

class_label  = [0,  2, 1, 1, 2, 3, 0, 1]
student_name = [0,  0, 1, 2, 1, 1, 0, 1]
student_id   = [0,  1, 2, 3, 4, 5, 6, 7]
grade        = [10, 5, 7, 4, 9, 8, 9, 3]

n_classes = 4
n_names   = 3

group_idx = np.vstack((class_label, student_name))
npg.aggregate(group_idx, grade, size=(4,3))

which returns:

array([[19,  0,  0],
       [ 0, 10,  4],
       [ 5,  9,  0],
       [ 0,  8,  0]])

I am trying now to get the ids involved in this grouping (adding very minimum overhead to the code if possible)

I want to build a 2D array of shape (n_classes, n_names) in which each cell is a list of all student_id values that share the same (class_label, student_name) pair. For example:

I expect the result to look like this (a 4×3 object array of lists):

array([[ [0, 6],   [],       [] ],
       [ [],       [2, 7],   [3 ],
       [ [1],      [4],      [] ],
       [ [],       [5],      [] ]], dtype=object)

in the final 2D output array, at row 0 (class 0) and column 0 (name 0), I expect the list [0, 6]. In other words, the cell at position [0,0] must be a list containing the IDs for all records whose class_label == 0 and student_name == 0.

Solution

You should use pandas here:

import pandas as pd

df = pd.DataFrame({'class': class_label,
                   'name': student_name,
                   'id': student_id,
                   'grade': grade,
                  })

which creates the following DataFrame:

   class  name  id  grade
0      0     0   0     10
1      2     0   1      5
2      1     1   2      7
3      1     2   3      4
4      2     1   4      9
5      3     1   5      8
6      0     0   6      9
7      1     1   7      3

Then use a pivot_table with custom aggfunc:

out = df.pivot_table(index='class', columns='name',
                     aggfunc={'grade': 'sum', 'id': list})

Output:

      grade                 id             
name      0     1    2       0       1    2
class                                      
0      19.0   NaN  NaN  [0, 6]     NaN  NaN
1       NaN  10.0  4.0     NaN  [2, 7]  [3]
2       5.0   9.0  NaN     [1]     [4]  NaN
3       NaN   8.0  NaN     NaN     [5]  NaN

Or independently:

grades = df.pivot_table(index='class', columns='name', values='grade',
                        aggfunc='sum', fill_value=0)
indices = df.pivot_table(index='class', columns='name', values='id',
                         aggfunc=list)

Note the npg.aggregate_np could also work:

npg.aggregate_np(group_idx, student_id, size=(4,3), func=list, dtype=object)

array([[array([0, 6]), 0, 0],
       [0, array([2, 7]), array([3])],
       [array([1]), array([4]), 0],
       [0, array([5]), 0]], dtype=object)