djangooptimizationforeign-keysgeneric-foreign-key

django: prefetch related objects of a GenericForeignKey


Suppose I have a model Box with a GenericForeignKey that points to either an Apple instance or a Chocolate instance. Apple and Chocolate, in turn, have ForeignKeys to Farm and Factory, respectively. I want to display a list of Boxes, for which I need to access Farm and Factory. How do I do this in as few DB queries as possible?

Minimal illustrative example:

class Farm(Model):
    ...

class Apple(Model):
    farm = ForeignKey(Farm)
    ...

class Factory(Model):
    ...

class Chocolate(Model):
    factory = ForeignKey(Factory)
    ...

class Box(Model)
    content_type = ForeignKey(ContentType)
    object_id = PositiveIntegerField()
    content_object = GenericForeignKey('content_type', 'object_id')
    ...

    def __unicode__(self):
        if self.content_type == ContentType.objects.get_for_model(Apple):
            apple = self.content_object
            return "Apple {} from Farm {}".format(apple, apple.farm)
        elif self.content_type == ContentType.objects.get_for_model(Chocolate):
            chocolate = self.content_object
            return "Chocolate {} from Factory {}".format(chocolate, chocolate.factory)

Here are a few things I tried. In all these examples, N is the number of Boxes. The query count assumes that the ContentTypes for Apple and Chocolate have already been cached, so the get_for_model() calls do not hit the DB.

1) Naive:

print [box for box in Box.objects.all()]

This does 1 (fetch Boxes) + N (fetch Apple or Chocolate for each Box) + N (fetch Farm for each Apple and Factory for each Chocolate) queries.

2) select_related doesn't help here, because Box.content_object is a GenericForeignKey.

3) As of django 1.4, prefetch_related can fetch GenericForeignKeys.

print [box for box in Box.objects.prefetch_related('content_object').all()]

This does 1 (fetch Boxes) + 2 (fetch Apples and Chocolates for all Boxes) + N (fetch Farm for each Apple and Factory for each Chocolate) queries.

4) Apparently prefetch_related isn't smart enough to follow ForeignKeys of GenericForeignKeys. If I try:

print [box for box in Box.objects.prefetch_related( 'content_object__farm', 'content_object__factory').all()]

it rightfully complains that Chocolate objects don't have a farm field, and vice versa.

5) I could do:

apple_ctype = ContentType.objects.get_for_model(Apple)
chocolate_ctype = ContentType.objects.get_for_model(Chocolate)
boxes_with_apples = Box.objects.filter(content_type=apple_ctype).prefetch_related('content_object__farm')
boxes_with_chocolates = Box.objects.filter(content_type=chocolate_ctype).prefetch_related('content_object__factory')

This does 1 (fetch Boxes) + 2 (fetch Apples and Chocolates for all Boxes) + 2 (fetch Farms for all Apples and Factories for all Chocolates) queries. The downside is that I have to merge and sort the two querysets (boxes_with_apples, boxes_with_chocolates) manually. In my real application, I'm displaying these Boxes in a paginated ModelAdmin. It's not obvious how to integrate this solution there. Maybe I could write a custom Paginator to do this caching transparently?

6) I could cobble together something based on this that also does O(1) queries. But I'd rather not mess with internals (_content_object_cache) if I can avoid it.

In summary: Printing a Box requires access to the ForeignKeys of a GenericForeignKey. How can I print N Boxes in O(1) queries? Is (5) the best I can do, or is there a simpler solution?

Bonus points: How would you refactor this DB schema to make such queries easier?


Solution

  • You can manually implement something like prefetch_selected and use Django's select_related method, that will make join in database query.

    apple_ctype = ContentType.objects.get_for_model(Apple)
    chocolate_ctype = ContentType.objects.get_for_model(Chocolate)
    boxes = Box.objects.all()
    content_objects = {}
    # apples
    content_objects[apple_ctype.id] = Apple.objects.select_related(
        'farm').in_bulk(
            [b.object_id for b in boxes if b.content_type == apple_ctype]
        )
    # chocolates
    content_objects[chocolate_ctype.id] = Chocolate.objects.select_related(
        'factory').in_bulk(
            [b.object_id for b in boxes if b.content_type == chocolate_ctype]
        )
    

    This should make only 3 queries (get_for_model queries are omitted). The in_bulk method returns a dict in the format {id: model}. So to get your content_object you need a code like:

    content_obj = content_objects[box.content_type_id][box.object_id]
    

    However I'm not sure if this code will be quicker than your O(5) solution as it requires additional iteration over boxes queryset and it also generates a query with a WHERE id IN (...) statement.

    But if you sort boxes only by fields from Box model you can fill the content_objects dict after pagination. But you need to pass content_objects to __unicode__ somehow.

    How would you refactor this DB schema to make such queries easier?

    We have a similar structure. We store content_object in Box, but instead of object_id and content_object we use ForeignKey(Box) in Apple and Chocolate. In Box we have a get_object method to return the Apple or Chocolate model. In this case we can use select_related, but in most of our use-cases we filter Boxes by content_type. So we have the same problems like your 5th option. But we started our project on Django 1.2 when there was no prefetch_selected.

    If you rename farm/factory to some common name, like creator, will prefetch_related work?

    About your option 6

    I can't say anything against filling _content_object_cache. If you don't like to deal with internals you can fill a custom property and then use

    apple = getattr(self, 'my_custop_prop', None)
    if apple is None:
        apple = self.content_object