sqldjangoprefetchdjango-prefetch-related

Pre-fetching huge querysets in django


TLDR: how can I prevent very large IN sets in the SQL query generated?

When I pre-fetch a ManyToMany field pointing to model Tag from a model Object:

obj_qs = models.Object.objects.filter(created_time__gt = 2024)
obj_qs.prefetch_related('tags')

I get a pre-fetching SQL query of the following form:

SELECT (`main_object_tag`.`object_id`) AS `_prefetch_related_val_object_id`, `main_tag`.`id`, `main_tag`.`name` FROM `main_tag` INNER JOIN `main_object_tag` ON (`main_tag`.`id` = `main_object_tag`.`tag_id`) WHERE `main_object_tag`.`object_id` IN (1, 2, 3, 4, 5); args=(1, 2, 3, 4, 5)

The problem is, the IN clause could be huge this way - hundreds of thousands of objects in my case. Could I work around this somehow - using a query instead of the "IN" specification?

Extra: why?

We would serialize the objects in the following way subsequently:

return JsonResponse([
  {
    'name': obj.name,
    'tags': [tag.name for tag in obj.tags.all()]
  } for obj in obj_qs
])

Solution

  • If the amount is too large, you can work with chunks, and thus fetch the queryset in chunks of ~10'000 elements for example.

    You can use the .iterator(…) [Django-doc] for this:

    return JsonResponse(
        [
            {'name': obj.name, 'tags': [tag.name for tag in obj.tags.all()]}
            for obj in obj_qs.iterator(10000)
        ]
    )

    another idea is to limit the amount of data transferred, by only fetching the relevant data from the object and the tag, in this case the .name by using .only(…) [Django-doc]:

    from django.db.models import Prefetch
    
    obj_qs = models.Object.objects.filter(created_time__gt=2024).only('name')
    obj_qs.prefetch_related(Prefetch('tags', Tag.objects.only('pk', 'name')))