pythonnumpy

Efficiently group IDs by two keys into a 2D array of lists


I have three NumPy arrays: class_label, student_name, student_id and grade. I am aggregating the grade per class and name as follows:

import numpy as np
import numpy_groupies as npg

class_label  = [0,  2, 1, 1, 2, 3, 0, 1]
student_name = [0,  0, 1, 2, 1, 1, 0, 1]
student_id   = [0,  1, 2, 3, 4, 5, 6, 7]
grade        = [10, 5, 7, 4, 9, 8, 9, 3]

n_classes = 4
n_names   = 3

group_idx = np.vstack((class_label, student_name))
npg.aggregate(group_idx, grade, size=(4,3))

which returns:

array([[19,  0,  0],
       [ 0, 10,  4],
       [ 5,  9,  0],
       [ 0,  8,  0]])

I am trying now to get the ids involved in this grouping (adding very minimum overhead to the code if possible)

I want to build a 2D array of shape (n_classes, n_names) in which each cell is a list of all student_id values that share the same (class_label, student_name) pair. For example:

I expect the result to look like this (a 4×3 object array of lists):

array([[ [0, 6],   [],       [] ],
       [ [],       [2, 7],   [3 ],
       [ [1],      [4],      [] ],
       [ [],       [5],      [] ]], dtype=object)

in the final 2D output array, at row 0 (class 0) and column 0 (name 0), I expect the list [0, 6]. In other words, the cell at position [0,0] must be a list containing the IDs for all records whose class_label == 0 and student_name == 0.


Solution

  • You should use here:

    import pandas as pd
    
    df = pd.DataFrame({'class': class_label,
                       'name': student_name,
                       'id': student_id,
                       'grade': grade,
                      })
    

    which creates the following DataFrame:

       class  name  id  grade
    0      0     0   0     10
    1      2     0   1      5
    2      1     1   2      7
    3      1     2   3      4
    4      2     1   4      9
    5      3     1   5      8
    6      0     0   6      9
    7      1     1   7      3
    

    Then use a pivot_table with custom aggfunc:

    out = df.pivot_table(index='class', columns='name',
                         aggfunc={'grade': 'sum', 'id': list})
    

    Output:

          grade                 id             
    name      0     1    2       0       1    2
    class                                      
    0      19.0   NaN  NaN  [0, 6]     NaN  NaN
    1       NaN  10.0  4.0     NaN  [2, 7]  [3]
    2       5.0   9.0  NaN     [1]     [4]  NaN
    3       NaN   8.0  NaN     NaN     [5]  NaN
    

    Or independently:

    grades = df.pivot_table(index='class', columns='name', values='grade',
                            aggfunc='sum', fill_value=0)
    indices = df.pivot_table(index='class', columns='name', values='id',
                             aggfunc=list)
    

    Note the npg.aggregate_np could also work:

    npg.aggregate_np(group_idx, student_id, size=(4,3), func=list, dtype=object)
    
    array([[array([0, 6]), 0, 0],
           [0, array([2, 7]), array([3])],
           [array([1]), array([4]), 0],
           [0, array([5]), 0]], dtype=object)