Suppose I have a model Box
with a GenericForeignKey
that points to either an Apple
instance or a Chocolate
instance. Apple
and Chocolate
, in turn, have ForeignKeys to Farm
and Factory
, respectively. I want to display a list of Box
es, for which I need to access Farm
and Factory
. How do I do this in as few DB queries as possible?
Minimal illustrative example:
class Farm(Model):
...
class Apple(Model):
farm = ForeignKey(Farm)
...
class Factory(Model):
...
class Chocolate(Model):
factory = ForeignKey(Factory)
...
class Box(Model)
content_type = ForeignKey(ContentType)
object_id = PositiveIntegerField()
content_object = GenericForeignKey('content_type', 'object_id')
...
def __unicode__(self):
if self.content_type == ContentType.objects.get_for_model(Apple):
apple = self.content_object
return "Apple {} from Farm {}".format(apple, apple.farm)
elif self.content_type == ContentType.objects.get_for_model(Chocolate):
chocolate = self.content_object
return "Chocolate {} from Factory {}".format(chocolate, chocolate.factory)
Here are a few things I tried. In all these examples, N is the number of Boxes. The query count assumes that the ContentType
s for Apple
and Chocolate
have already been cached, so the get_for_model()
calls do not hit the DB.
1) Naive:
print [box for box in Box.objects.all()]
This does 1 (fetch Boxes) + N (fetch Apple or Chocolate for each Box) + N (fetch Farm for each Apple and Factory for each Chocolate) queries.
2) select_related
doesn't help here, because Box.content_object
is a GenericForeignKey
.
3) As of django 1.4, prefetch_related
can fetch GenericForeignKey
s.
print [box for box in Box.objects.prefetch_related('content_object').all()]
This does 1 (fetch Boxes) + 2 (fetch Apples and Chocolates for all Boxes) + N (fetch Farm for each Apple and Factory for each Chocolate) queries.
4) Apparently prefetch_related
isn't smart enough to follow ForeignKeys of GenericForeignKeys. If I try:
print [box for box in Box.objects.prefetch_related(
'content_object__farm',
'content_object__factory').all()]
it rightfully complains that Chocolate
objects don't have a farm
field, and vice versa.
5) I could do:
apple_ctype = ContentType.objects.get_for_model(Apple)
chocolate_ctype = ContentType.objects.get_for_model(Chocolate)
boxes_with_apples = Box.objects.filter(content_type=apple_ctype).prefetch_related('content_object__farm')
boxes_with_chocolates = Box.objects.filter(content_type=chocolate_ctype).prefetch_related('content_object__factory')
This does 1 (fetch Boxes) + 2 (fetch Apples and Chocolates for all Boxes) + 2 (fetch Farms for all Apples and Factories for all Chocolates) queries. The downside is that I have to merge and sort the two querysets (boxes_with_apples
, boxes_with_chocolates
) manually. In my real application, I'm displaying these Boxes in a paginated ModelAdmin. It's not obvious how to integrate this solution there. Maybe I could write a custom Paginator to do this caching transparently?
6) I could cobble together something based on this that also does O(1) queries. But I'd rather not mess with internals (_content_object_cache
) if I can avoid it.
In summary: Printing a Box requires access to the ForeignKeys of a GenericForeignKey. How can I print N Boxes in O(1) queries? Is (5) the best I can do, or is there a simpler solution?
Bonus points: How would you refactor this DB schema to make such queries easier?
You can manually implement something like prefetch_selected
and use Django's select_related
method, that will make join in database query.
apple_ctype = ContentType.objects.get_for_model(Apple)
chocolate_ctype = ContentType.objects.get_for_model(Chocolate)
boxes = Box.objects.all()
content_objects = {}
# apples
content_objects[apple_ctype.id] = Apple.objects.select_related(
'farm').in_bulk(
[b.object_id for b in boxes if b.content_type == apple_ctype]
)
# chocolates
content_objects[chocolate_ctype.id] = Chocolate.objects.select_related(
'factory').in_bulk(
[b.object_id for b in boxes if b.content_type == chocolate_ctype]
)
This should make only 3 queries (get_for_model
queries are omitted). The in_bulk
method returns a dict in the format {id: model}. So to get your content_object you need a code like:
content_obj = content_objects[box.content_type_id][box.object_id]
However I'm not sure if this code will be quicker than your O(5) solution as it requires additional iteration over boxes queryset and it also generates a query with a WHERE id IN (...)
statement.
But if you sort boxes only by fields from Box model you can fill the content_objects
dict after pagination. But you need to pass content_objects
to __unicode__
somehow.
How would you refactor this DB schema to make such queries easier?
We have a similar structure. We store content_object
in Box
, but instead of object_id
and content_object
we use ForeignKey(Box)
in Apple
and Chocolate
. In Box
we have a get_object
method to return the Apple or Chocolate model. In this case we can use select_related
, but in most of our use-cases we filter Boxes by content_type. So we have the same problems like your 5th option. But we started our project on Django 1.2 when there was no prefetch_selected.
If you rename farm/factory to some common name, like creator, will prefetch_related work?
About your option 6
I can't say anything against filling _content_object_cache
.
If you don't like to deal with internals you can fill a custom property and then use
apple = getattr(self, 'my_custop_prop', None)
if apple is None:
apple = self.content_object