I have two models Author
and Book
which are related via m2m (one author can have many books, one book can have many authors)
Often we need to query and match records for ingests using text strings, across both models ie: "JRR Tolkien - Return of the King" when unique identifiers are not available.
I would like to test if using SearchVectorField
with GIN indexes
can improve full-text search response times - but since the search query will be SearchVector(author__name, book__title)
It seems that both models need a SearchVectorField added.
This becomes more complicated when each table needs updating since it appears Postgres Triggers need to be set up on both tables, which might make updating anything completely untenable.
What is the modern best practice in Django for adopting vectorised full-text search methods when m2m related models are concerned? Should the SearchVectorField
be placed through a table? Or in each model? How should triggers be applied?
I've been searching for guides on this specifically - but no one seems to mention m2ms when talking about SearchVectorFields. I did find this old question
Also, if Postgres is really not the way forward in modern Django I'd also gladly take direction in something better suited/supported/documented. In our case, we are using Postgres 11.6.
from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex
class Author(models.Model):
name = models.CharField(max_length=100, unique=True)
main_titles = models.ManyToManyField(
"Book",
through="BookMainAuthor",
related_name="main_authors",
)
search = SearchVectorField(null=True)
class BookMainAuthor(models.Model):
"""The m2m through table for book and author (main)"""
book = models.ForeignKey("Book", on_delete=models.CASCADE)
artist = models.ForeignKey("Author", on_delete=models.CASCADE)
class Meta:
unique_together = ["book", "author"]
class Book(models.Model):
title = models.CharField(max_length=100, unique=True)
search = SearchVectorField(null=True)
Exploring Yevgeniy-kosmak's answer below, this is a simple way to index the string permutations of the through table for Book.title
and Author.name
Performing a search using the SearchVectorField is fast and a little more effective for some titles that have multiple authors.
However when trying to use SearchRank
- things slow down dramatically:
BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).order_by("-rank:).explain(analyze=True)
"Gather Merge (cost=394088.44..489923.26 rows=821384 width=227) (actual time=8569.729..8812.096 rows=989307 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=393088.41..394115.14 rows=410692 width=227) (actual time=8559.074..8605.681 rows=329769 loops=3)
Sort Key: (ts_rank(to_tsvector(COALESCE((search_vector)::text, ''::text)), plainto_tsquery('JRR Tolkien - Return of the King'::text), 6)) DESC
Sort Method: external merge Disk: 77144kB
–
Worker 0: Sort Method: external merge Disk: 76920kB
Worker 1: Sort Method: external merge Disk: 76720kB
-> Parallel Seq Scan on bookstore_bookmainauthor (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.589..8378.569 rows=329769 loops=3)
Planning Time: 0.369 ms
Execution Time: 8840.139 ms"
Without the sort, only saves 500ms:
BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).explain(analyze=True)
'Gather (cost=1000.00..364517.21 rows=985661 width=227) (actual time=0.605..8282.976 rows=989307 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on bookstore_bookmainauthor (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.356..8187.242 rows=329769 loops=3)
Planning Time: 0.039 ms
Execution Time: 8306.799 ms'
However I noticed that if you do the following, it dramatically improves the query execution time (~17x), sorting included.
F Expression
to the first argument of SearchRank
(instead of using the name of the field in quotes which is what is directed to do in the documentation)config
kwarg to the SearchQuery
BookMainAuthor.objects.annotate(rank=SearchRank(F("search"), SearchQuery("JRR Tolkien - Return of the King", config='english')).order_by("-rank").explain(analyze=True)
Gather Merge (cost=304240.66..403077.76 rows=847116 width=223) (actual time=336.654..559.367 rows=989307 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=303240.63..304299.53 rows=423558 width=223) (actual time=334.970..373.282 rows=329769 loops=3)
Sort Key: (ts_rank(search_vector, '''jrr'' & ''tolkien'' & ''return'' & ''king'''::tsquery)) DESC
Sort Method: external merge Disk: 75192kB
Worker 0: Sort Method: external merge Disk: 76672kB
Worker 1: Sort Method: external merge Disk: 76976kB
-> Parallel Seq Scan on bookstore_bookmainauthor (cost=0.00..173893.48 rows=423558 width=223) (actual time=0.014..211.007 rows=329769 loops=3)
Planning Time: 0.059 ms
Execution Time: 584.402 ms
Finally got it. I suppose you need to search by query containing the author and the book's name at the same time. And you wouldn't be able to separate them to look at Book
table for "book" part of the query and the same for Author
.
Yep, making an index of fields from separate tables is impossible with PostgreSQL. I don't see it as a weakness of PostgreSQL, it's just a very unusual case when you really need such an index. In most cases, there are other solutions, not worse as for efficiency. Of course, you can always look at ElasticSearch
if for some reason you are sure that it's necessary.
I'll advise you of such an approach. You can make BookMainAuthor
with this structure:
class BookMainAuthor(models.Model):
"""The m2m through table for book and author (main)"""
book = models.ForeignKey("Book", on_delete=models.CASCADE)
artist = models.ForeignKey("Author", on_delete=models.CASCADE)
book_full_name = models.CharField(max_length=200)
search = SearchVectorField(null=True)
class Meta:
unique_together = ["book", "author"]
As it seems to me it shouldn't cause any trouble to maintain book_full_name
field, which would contain both author and book names with an appropriate separator in it. Everything else is a textbook case.
From my experience, if table BookMainAuthor
would contain not more than 10M entries, on an average single server (for example like AX161 from here) everything would be just fine.