I have to implement a search function which will be fault tolerant.
Currently, I have the following situation:
Models:
class Tag(models.Model):
name = models.CharField(max_length=255)
class Illustration(models.Model):
name = models.CharField(max_length=255)
tags = models.ManyToManyField(Tag)
Query:
queryset.annotate(similarity=TrigramSimilarity('name', fulltext) + TrigramSimilarity('tags__name', fulltext))
Example data:
Illustrations:
ID | Name | Tags |
---|--------|-------------------|
1 | "Dog" | "Animal", "Brown" |
2 | "Cat" | "Animals" |
Illustration has Tags:
ID_Illustration | ID_Tag |
----------------|--------|
1 | 1 |
1 | 2 |
2 | 3 |
Tags:
ID_Tag | Name |
-------|----------|
1 | Animal |
2 | Brown |
3 | Animals |
When I run the query with "Animal"
, the similarity for "Dog"
should be higher than for "Cat"
, as it is a perfect match.
Unfortunately, both tags are considered together somehow.
Currently, it looks like it's concatenating the tags in a single string and then checks for similarity:
TrigramSimilarity("Animal Brown", "Animal") => X
But I would like to adjust it in a way that I will get the highest similarity between an Illustration
instance name and its tags:
Max([
TrigramSimilarity('Name', "Animal"),
TrigramSimilarity("Tag_1", "Animal"),
TrigramSimilarity("Tag_2", "Animal"),
]) => X
Edit1: I'm trying to query all Illustration, where either the title or one of the tags has a similarity bigger than X.
Edit2: Additional example:
fulltext = 'Animal'
TrigramSimilarity('Animal Brown', fulltext) => x TrigramSimilarity('Animals', fulltext) => y
Where x < y
But what I want is actually
TrigramSimilarity(Max(['Animal', 'Brown]), fulltext) => x (Similarity to Animal) TrigramSimilarity('Animals', fulltext) => y
Where x > y
You cannot break up the tags__name
(at least I don't know a way).
From your examples, I can assume 2 possible solutions (1st solution is not strictly using Django):
Not everything needs to pass strictly through Django
We have Python powers, so let's use them:
Let us compose the query first:
from difflib import SequenceMatcher
from django.db.models import Q
def create_query(fulltext):
illustration_names = Illustration.objects.values_list('name', flat=True)
tag_names = Tag.objects.values_list('name', flat=True)
query = []
for name in illustration_names:
score = SequenceMatcher(None, name, fulltext).ratio()
if score == 1:
# Perfect Match for name
return [Q(name=name)]
if score >= THRESHOLD:
query.append(Q(name=name))
for name in tag_names:
score = SequenceMatcher(None, name, fulltext).ratio()
if score == 1:
# Perfect Match for name
return [Q(tags__name=name)]
if score >= THRESHOLD:
query.append(Q(tags__name=name))
return query
Then to create your queryset:
from functools import reduce # Needed only in python 3
from operator import or_
queryset = Illustration.objects.filter(reduce(or_, create_query(fulltext)))
Decode the above:
We are checking every Illustration
and Tag
name against our fulltext
and we are composing a query with every name that it's similarity passes the THRESHOLD
.
SequenceMatcher
method compares sequences and returns a ratio 0 < ratio < 1
where 0 indicates No-Match and 1 indicates Perfect-Match. Check this answer for another usage example: Find the similarity percent between two strings (Note: There are other strings comparing modules as well, find one that suits you)Q()
Django objects, allow the creation of complex queries (more on the linked docs).operator
and reduce
we transform a list of Q()
objects to an OR separated query argument: Q(name=name_1) | Q(name=name_2) | ... | Q(tag_name=tag_name_1) | ...
Note:
You need to define an acceptable THRESHOLD
.
As you can imagine this will be a bit slow but it is to be expected when you need to do a "fuzzy" search.
(The Django Way:)
Use a query with a high similarity threshold and order the queryset by this similarity rate:
queryset.annotate(
similarity=Greatest(
TrigramSimilarity('name', fulltext),
TrigramSimilarity('tags__name', fulltext)
)).filter(similarity__gte=threshold).order_by('-similarity')
Decode the above:
Greatest()
accepts an aggregation (not to be confused with the Django method aggregate
) of expressions or of model fields and returns the max item.TrigramSimilarity(word, search)
returns a rate between 0 and 1. The closer the rate is to 1, the more similar the word
is to search
..filter(similarity__gte=threshold)
, will filter similarities lower than the threshold
.0 < threshold < 1
. You can set the threshold to 0.6
which is pretty high (consider that the default is 0.3
). You can play around with that to tune your performance.similarity
rate in a descending order.