pythonnumpymeanrecarray

Numpy Mean Structured Array


Suppose that I have a structured array of students (strings) and test scores (ints), where each entry is the score that a specific student received on a specific test. Each student has multiple entries in this array, naturally.

Example

import numpy
grades = numpy.array([('Mary', 96), ('John', 94), ('Mary', 88), ('Edgar', 89), ('John', 84)],
                     dtype=[('student', 'a50'), ('score', 'i')])

print grades
#[('Mary', 96) ('John', 94) ('Mary', 88) ('Edgar', 89) ('John', 84)]

How do I easily compute the average score of each student? In other words, how do I take the mean of the array in the 'score' dimension? I'd like to do

grades.mean('score')

and have Numpy return

[('Mary', 92), ('John', 89), ('Edgar', 89)]

but Numpy complains

TypeError: an integer is required

Is there a Numpy-esque way to do this easily? I think it might involve taking a view of the structured array with a different dtype. Any help would be appreciated. Thanks.

Edit

>>> grades = numpy.zeros(5, dtype=[('student', 'a50'), ('score', 'i'), ('testid', 'i'])
>>> grades[0] = ('Mary', 96, 1)
>>> grades[1] = ('John', 94, 1)
>>> grades[2] = ('Mary', 88, 2)
>>> grades[3] = ('Edgar', 89, 1)
>>> grades[4] = ('John', 84, 2)
>>> np.mean(grades, 'testid')
TypeError: an integer is required

Solution

  • NumPy isn't designed to be able to group rows together and apply aggregate functions to those groups. You could:

    Here's the itertools solution, but as you can see it's quite complicated and inefficient. I'd recommend one of the other two methods.

    np.array([(k, np.array(list(g), dtype=grades.dtype).view(np.recarray)['score'].mean())
              for k, g in groupby(np.sort(grades, order='student').view(np.recarray),
                                  itemgetter('student'))], dtype=grades.dtype)