As part of my pet project, I've been using ElasticSearch for a while to store some of my entities and allow users some, I would say, decent search capabilities.
I wanted to migrate from Elastic as I've been having some resource usage issues and always thought that Elastic was a bit of an overkill solution for what I needed (just some basic word order independent fuzzy search) and I started playing around with ArangoDB.
But for the last few days I've been really struggling to achieve the same results with Arango when trying to run multi attribute, multi word fuzzy searches. Let me show you an example.
These are some documents I'm storing in my DB and in Arango (only showing relevant attributes):
"edition_year","long_event_name","short_event_name"
2024,"71st Macau Grand Prix - FIA Formula Regional World Cup","2024 Macau Grand Prix"
2024,"2024 Kumho FIA TCR World Tour Event of Macau","2024 TCR Macau"
2024,"2024 FIA GT World Cup","2024 Macau GT Cup"
2023,"2023 TCR Asia - Macau","2023 TCR Macau"
2023,"2023 FIA GT World Cup - Macau","2023 Macau GT Cup"
2023,"70th Macau Grand Prix","2023 Macau Grand Prix"
2022,"Melco Greater Bay Area GT Cup 2022","2022 Macau GT Cup"
2022,"69th Macau Grand Prix","2022 Macau Grand Prix"
2021,"MGM Greater Bay Area GT Cup 2021","2021 Macau GT Cup"
2021,"68th Macau Grand Prix","2021 Macau GP"
So, if user types "Macao 2024", I would like to get the first three documents as the ones with the best score (Macao is just one character away from Macau in both long_event_name and short_event_name attributes in two of them and 2024 is present in, at least, edition_year and short_event_name).
So far, no matter how much I've tried using different analyzers and different queries but results are far from satisfactory and I'm starting to wonder if that's possible at all with Arango or if it's just me just going down a rabbit hole and not seeing the obvious solution.
Any help on how to set up my view, my analyzers and the query will be really appreciated.
In the end, I came up with this query that, even if it's not perfect, returns a quite decent result. Will have a look later on in case it can be improved for now this is as good as it gets.
FOR d IN eventEditionsView
SEARCH
ANALYZER(BOOST(d.longEventName like "%macao%", 1.500000), "en_tokenizer") AND ANALYZER(BOOST(d.longEventName like "%2024%", 1.500000), "en_tokenizer") OR
ANALYZER(BOOST(d.shortEventName like "%macao%", 1.000000), "en_tokenizer") AND ANALYZER(BOOST(d.shortEventName like "%2024%", 1.000000), "en_tokenizer") OR
BOOST(PHRASE(d.longEventName, [ { LEVENSHTEIN_MATCH : ["macao", 1, true] } ], "en_tokenizer") AND PHRASE(d.longEventName, [ { LEVENSHTEIN_MATCH : ["2024", 1, true] } ], "en_tokenizer"), 1.500000) OR
BOOST(PHRASE(d.shortEventName, [ { LEVENSHTEIN_MATCH : ["macao", 1, true] } ], "en_tokenizer") AND PHRASE(d.shortEventName, [ { LEVENSHTEIN_MATCH : ["2024", 1, true] } ], "en_tokenizer"), 1.000000)
SORT TFIDF(d) DESC
RETURN d
And this would be the (beautified) output:
[
{
"editionYear": 2024,
"longEventName": "2024 Kumho FIA TCR World Tour Event of Macau",
"shortEventName": "2024 TCR Macau",
"score": 12.256467819213867
},
{
"editionYear": 2023,
"longEventName": "2023 TCR Asia - Macau",
"shortEventName": "2023 TCR Macau",
"score": 11.729081153869629
},
{
"editionYear": 2023,
"longEventName": "2023 FIA GT World Cup - Macau",
"shortEventName": "2023 Macau GT Cup",
"score": 11.729081153869629
},
{
"editionYear": 2024,
"longEventName": "2024 FIA GT World Cup",
"shortEventName": "2024 Macau GT Cup",
"score": 4.82772159576416
},
{
"editionYear": 2024,
"longEventName": "71st Macau Grand Prix - FIA Formula Regional World Cup",
"shortEventName": "2024 Macau Grand Prix",
"score": 4.82772159576416
},
{
"editionYear": 2021,
"longEventName": "68th Macau Grand Prix",
"shortEventName": "2021 Macau GP",
"score": 4.618818283081055
}
]