djangodjango-1.11django-mysql

Django - Cannot perform another query while using a queryset iterator()


I'm using Django 1.11 with MySQL. Upgrading to 2 isn't feasible in the short term so isn't an acceptable solution to my immediate problem, but answers referring to Django 2 may help others so feel free to post them.

I need to perform a data migration on all rows in a table. There are less than 40000 rows but they are quite big - two of the columns are ~15KB of JSON which get parsed when the model is loaded. (These are the rows I need to use in the data migration so I cannot defer them)

So as not to load all the objects into memory simultaneously, I thought I'd use queryset.iterator which only parses rows 100 at time. This works fine if all I do is read the results, but if I perform another query (eg to save one of the objects) then once I reach the end of the current chunk of 100 results, the next chunk of 100 results are not fetched and the iterator finishes.

It's as if the result set that fetchmany fetches the rows from has been lost.

To illustrate the scenario using ./manage.py shell (Assume there exist 40000 MyModel with sequential ids)

iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
  print(obj.id)

The above prints the ids 1 to 40000 as expected.

iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
  print(obj.id)
  obj.save()

The above only prints the ids 1 to 100

iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
  print(obj.id)
  if obj.id == 101:
    obj.save()

The above only prints the ids 1 to 200

Replacing obj.save with anything else that makes a query to the DB (eg app.models.OtherModel.objects.first()) has the same result.

Is it simply not possible to make another query while using queryset iterator? Is there another way to achieve the same thing?

Thanks


Solution

  • As suggested by @dirkgroten, Paginator is an alternative to iterator that's potentially a better solution in terms of memory usage as it uses slicing on the queryset which adds OFFSET and LIMIT clauses to retrieve only part of the full result set.

    However, high OFFSET values incur a performance penalty on MySQL: https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/

    Therefore seeking on an indexed column may be a better option:

    chunk_size = 100
    seek_id = 0
    next_seek_id = -1
    while seek_id != next_seek_id:
      seek_id = next_seek_id
      for obj in app.models.MyModel.objects.filter(id__gt=seek_id)[:chunk_size]:
        next_seek_id = obj.id
        # do your thing
    
    

    Additionally, if your data is such that performing the query isn't expensive but instantiating model instances is, iterator has the potential advantage of doing a single database query. Hopefully other answers will be able to shed light on the use of queryset.iterator with other queries.