arangodb

Arangosearch on multiple fields with scoring


I think conceptually what I'm trying to do is very logical and straightforward. But I haven't been able to figure out a way to do this.

Using arangodb 3.12.4

If I have a document like:

product:  apricot yoghurt
category: food
type: sugarfree

And I search for "sugarfree yoghurt"

I want to match the document above, but also match documents such as:

product:  cherry yoghurt
category: food
type: fatfree

product:  yoghurt starter
category: condiment
type: powder

but the one above should be ranked highest because it has two terms match, across multiple fields.

I'm finding it fascinating that I still haven't been able to find any docs or any answered questions on this kind of use-case. And I'm starting to dread the fact that this may just not be supported.

One option is to have an extra field with a concatenation of all the fields I want to search. But then what if I want to boost the scores for certain fields?


Solution

  • If I understand correctly, you want to match every document with at least one matching token (either product, category, or type has to contain at least one of sugarfree, yoghurt, food). This can be expressed by searching for the tokens either with doc.field IN [ token1, token2, ... ] (comparison operator) or [ token1, token2, ... ] ANY == doc.field (array comparison operator) in each of the fields and combining these sub-expressions with logical OR.

    Relevant docs: Search operators, Searching Full-text with ArangoSearch

    More matching tokens in any of the document fields should result in a higher ranking. This is how the ranking functions work anyway. To adjust the relevance of certain fields, you can use the BOOST() function. Also see Query Time Relevance Tuning.

    LET a = "text_en"
    LET t = TOKENS("sugarfree yoghurt food", a)
    FOR doc IN v
      SEARCH ANALYZER(doc.product IN t OR BOOST(doc.category IN t, 2) OR doc.type IN t, a)
      // or: SEARCH ANALYZER(t ANY == doc.product OR BOOST(t ANY == doc.category, 2) OR t ANY == doc.type, a)
      // or: SEARCH ANALYZER(MIN_MATCH(doc.product IN t, BOOST(doc.category IN t, 2), doc.type IN t, 1), a)
      LET score = BM25(doc)
      SORT score DESC
      RETURN MERGE(doc, {score})
    

    (System attributes omitted)

    category product type score
    food apricot yoghurt sugarfree 3.0241031646728516
    condiment sugarfree yoghurt food powder 2.882746934890747
    food cherry yoghurt fatfree 1.6378090381622314
    food sugar beet vegetable 1.0779929161071777

    AQL query for the dataset:

    LET products = [
      { product: "apricot yoghurt", category: "food", type: "sugarfree" },
      { product: "cherry yoghurt", category: "food", type: "fatfree"},
      { product: "sugarfree yoghurt food", category: "condiment", type: "powder"},
      { product: "sugar beet", category: "food", type: "vegetable" },
      { product: "bluray player", category: "electronics", type: "device" },
    ]
    
    FOR p IN products INSERT p INTO @@coll