I need to import a database (given in JSON format) of papers and authors. The database is very large (194 million entries) so I am forced to use django's bulk_create() method.
To load the authors for the first time I use the following script:
def load_authors(paper_json_entries: List[Dict[str, any]]):
authors: List[Author] = []
for paper_json in paper_json_entries:
for author_json in paper_json['authors']:
# len != 0 is needed as a few authors dont have a id
if len(author_json['ids']) and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
authors.append(Author(author_id=author_json['ids'][0], name=author_json['name']))
Author.objects.bulk_create(set(authors))
However, this is much too slow. The bottleneck is this query:
and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
Unfortunately I have to make this query, because of course one author can write multiple papers and otherwise there will be a key conflict.
Is there a way to implement something like the normal get_or_create()
efficiently with bulk_create?
To avoid creating entries with existing unique keys, you can enable the ignore_conflicts
parameter:
def load_authors(paper_json_entries: List[Dict[str, any]]):
Author.objects.bulk_create(
(
Author(author_id=author_json['ids'][0], name=author_json['name'])
for paper_json in paper_json_entries
for author_json in paper_json['authors']
),
ignore_conflicts=True
)