jsondjangodatabasebulkinsertbulk-load

Django: Is there a way to effienctly bulk get_or_create()


I need to import a database (given in JSON format) of papers and authors. The database is very large (194 million entries) so I am forced to use django's bulk_create() method.

To load the authors for the first time I use the following script:

def load_authors(paper_json_entries: List[Dict[str, any]]):
    authors: List[Author] = []
    for paper_json in paper_json_entries:
        for author_json in paper_json['authors']:
            # len != 0 is needed as a few authors dont have a id
            if len(author_json['ids']) and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
                authors.append(Author(author_id=author_json['ids'][0], name=author_json['name']))
    Author.objects.bulk_create(set(authors))

However, this is much too slow. The bottleneck is this query:

and not Author.objects.filter(author_id=author_json['ids'][0]).exists():

Unfortunately I have to make this query, because of course one author can write multiple papers and otherwise there will be a key conflict.

Is there a way to implement something like the normal get_or_create() efficiently with bulk_create?


Solution

  • To avoid creating entries with existing unique keys, you can enable the ignore_conflicts parameter:

    def load_authors(paper_json_entries: List[Dict[str, any]]):
        Author.objects.bulk_create(
            (
                Author(author_id=author_json['ids'][0], name=author_json['name'])
                for paper_json in paper_json_entries
                for author_json in paper_json['authors']
            ),
            ignore_conflicts=True
        )