pythondjangopandasdataframedata-processing

A lightweight approach to processing Django Queryset data


I am looking for a optimal way to perform simple data processing from Django Queryset. I would like to not need to install libraries with high volumes like Pandas or numpy. The number of rows in Queryset should not exceed 2000. The idea is to perform basic functions such as x1, x2, x3 below. I want to avoid a separate database query for each of them so I'm thinking to get all this data once from the database, convert it to a standard data structure and calculate the results.

Queryset Sample:

T1_id    T2_id   T1_value    T2_value
  1        2        2           0
  3        5        0           0
  4        1        1           1
  2        7        0           3

Pandas code equivalents:

data = [[1, 2, 2, 0], [3, 5, 0, 0], [4, 1, 1, 1], [2, 7, 0, 3]]

df = pd.DataFrame(data, columns=['T1_id', 'T2_id', 'T1_value', 'T2_value'])

x1 = df[df['T1_id'] == 1]['T1_value'].mean()  # Mean of T1_value from rows where T1_id == 1
x2 = df[df['T1_id'] == 1]['T2_value'].sum()  # Sum of T2_value from rows where T1_id == 1
x3 = len(df[df['T1_id'] == 1])  # Number of rows where T1_id == 1

Solution

  • You can easily calculate these all with an .aggregate(…) [Django-doc]:

    from django.db.models import Avg, Count, Sum
    
    MyModel.objects.filter(T1_id=1).aggregate(
        x1=Avg('T1_value'), x2=Sum('T2_value'), x3=Count('*')
    )

    this will return a dictionary that maps the items to the corresponding result, for example:

    {'x1': 14.25, 'x2': 1302, 'x3': 42}

    By doing this as a query, the database will calculate this, and database is closer to the data and often is optimized to make aggregates, so will likely outperform pandas quite significantly. Especially since if we would use pandas to aggregate, it would first need to move the data from the database to pandas, which will likely take longer than the database determining the values directly.