pythondjangopostgresqlfull-text-searchm2m

Using SearchVectorFields on many to many related models


I have two models Author and Book which are related via m2m (one author can have many books, one book can have many authors)

Often we need to query and match records for ingests using text strings, across both models ie: "JRR Tolkien - Return of the King" when unique identifiers are not available.

I would like to test if using SearchVectorField with GIN indexes can improve full-text search response times - but since the search query will be SearchVector(author__name, book__title) It seems that both models need a SearchVectorField added.

This becomes more complicated when each table needs updating since it appears Postgres Triggers need to be set up on both tables, which might make updating anything completely untenable.

Question

What is the modern best practice in Django for adopting vectorised full-text search methods when m2m related models are concerned? Should the SearchVectorField be placed through a table? Or in each model? How should triggers be applied?

I've been searching for guides on this specifically - but no one seems to mention m2ms when talking about SearchVectorFields. I did find this old question

Also, if Postgres is really not the way forward in modern Django I'd also gladly take direction in something better suited/supported/documented. In our case, we are using Postgres 11.6.

Repro

from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Author(models.Model):
    name = models.CharField(max_length=100, unique=True)
    main_titles = models.ManyToManyField(
        "Book",
        through="BookMainAuthor",
        related_name="main_authors",
    )
    search = SearchVectorField(null=True)

class BookMainAuthor(models.Model):
    """The m2m through table for book and author (main)"""

    book = models.ForeignKey("Book", on_delete=models.CASCADE)
    artist = models.ForeignKey("Author", on_delete=models.CASCADE)

    class Meta:
        unique_together = ["book", "author"]

class Book(models.Model):
    title = models.CharField(max_length=100, unique=True)
    search = SearchVectorField(null=True)

Exploring indexing the M2M Through table

Exploring Yevgeniy-kosmak's answer below, this is a simple way to index the string permutations of the through table for Book.title and Author.name

Performing a search using the SearchVectorField is fast and a little more effective for some titles that have multiple authors.

However when trying to use SearchRank - things slow down dramatically:

BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).order_by("-rank:).explain(analyze=True)

"Gather Merge  (cost=394088.44..489923.26 rows=821384 width=227) (actual time=8569.729..8812.096 rows=989307 loops=1)
Workers Planned: 2
Workers Launched: 2
  ->  Sort  (cost=393088.41..394115.14 rows=410692 width=227) (actual time=8559.074..8605.681 rows=329769 loops=3)
        Sort Key: (ts_rank(to_tsvector(COALESCE((search_vector)::text, ''::text)), plainto_tsquery('JRR Tolkien - Return of the King'::text), 6)) DESC
        Sort Method: external merge  Disk: 77144kB
 – 

Worker 0:  Sort Method: external merge  Disk: 76920kB
        Worker 1:  Sort Method: external merge  Disk: 76720kB
        ->  Parallel Seq Scan on bookstore_bookmainauthor  (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.589..8378.569 rows=329769 loops=3)
Planning Time: 0.369 ms
Execution Time: 8840.139 ms"

Without the sort, only saves 500ms:

BookMainAuthor.objects.annotate(rank=SearchRank("search", SearchQuery("JRR Tolkien - Return of the King")).explain(analyze=True)

'Gather  (cost=1000.00..364517.21 rows=985661 width=227) (actual time=0.605..8282.976 rows=989307 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on bookstore_bookmainauthor (cost=0.00..264951.11 rows=410692 width=227) (actual time=0.356..8187.242 rows=329769 loops=3)
Planning Time: 0.039 ms
Execution Time: 8306.799 ms'

However I noticed that if you do the following, it dramatically improves the query execution time (~17x), sorting included.

  1. Add an F Expression to the first argument of SearchRank (instead of using the name of the field in quotes which is what is directed to do in the documentation)
  2. Adding a config kwarg to the SearchQuery
BookMainAuthor.objects.annotate(rank=SearchRank(F("search"), SearchQuery("JRR Tolkien - Return of the King", config='english')).order_by("-rank").explain(analyze=True)

Gather Merge  (cost=304240.66..403077.76 rows=847116 width=223) (actual time=336.654..559.367 rows=989307 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Sort  (cost=303240.63..304299.53 rows=423558 width=223) (actual time=334.970..373.282 rows=329769 loops=3)
        Sort Key: (ts_rank(search_vector, '''jrr'' & ''tolkien'' & ''return'' & ''king'''::tsquery)) DESC
        Sort Method: external merge  Disk: 75192kB
        Worker 0:  Sort Method: external merge  Disk: 76672kB
        Worker 1:  Sort Method: external merge  Disk: 76976kB
        ->  Parallel Seq Scan on bookstore_bookmainauthor  (cost=0.00..173893.48 rows=423558 width=223) (actual time=0.014..211.007 rows=329769 loops=3)
Planning Time: 0.059 ms
Execution Time: 584.402 ms


Solution

  • Finally got it. I suppose you need to search by query containing the author and the book's name at the same time. And you wouldn't be able to separate them to look at Book table for "book" part of the query and the same for Author.

    Yep, making an index of fields from separate tables is impossible with PostgreSQL. I don't see it as a weakness of PostgreSQL, it's just a very unusual case when you really need such an index. In most cases, there are other solutions, not worse as for efficiency. Of course, you can always look at ElasticSearch if for some reason you are sure that it's necessary.

    I'll advise you of such an approach. You can make BookMainAuthor with this structure:

    class BookMainAuthor(models.Model):
        """The m2m through table for book and author (main)"""
    
        book = models.ForeignKey("Book", on_delete=models.CASCADE)
        artist = models.ForeignKey("Author", on_delete=models.CASCADE)
        book_full_name = models.CharField(max_length=200)
        search = SearchVectorField(null=True)
    
        class Meta:
            unique_together = ["book", "author"]
    

    As it seems to me it shouldn't cause any trouble to maintain book_full_name field, which would contain both author and book names with an appropriate separator in it. Everything else is a textbook case.

    From my experience, if table BookMainAuthor would contain not more than 10M entries, on an average single server (for example like AX161 from here) everything would be just fine.