ravendb

Does RavenDB tokenize and filter queries?


If one defines a field in RavenDB for fulltext search, it uses an analyzer which tokenizes the field and does post processing (source). If that field is now queried, what happens to the search term in the query? Is it also tokenized and post-processed? If yes, is it tokenized and post-processed by the same analyzer which is used during index time? Can the analyzer for indexing and for querying be different?

An example:

Collection:

{"Name": "xxxabcd", "@metadata": {"@collection": "Names"}}
{"Name": "yyyabcd", "@metadata": {"@collection": "Names"}}

Index:

from names in docs.Names
select new {
    names.Name
}

Activate Indexing on the Name field to Search and use the NGram analyzer (no idea how to do this in RQL). NGram creates 2-6 character long tokens out of the name (source). So one token will be abcd which is shared by both documents.

Query:

from index "Names/ByName" 
where search(Name, "xxxabcd")

The query returns no search results. If the search term would be post-processed to an NGram of abcd, it would return both documents, but it does not. So what happends to the search term xxxabcd?

I can not find any documentation how search terms on full-text fields are handled.


Solution

  • Usually, the analyzer that is configured in the index definition is run both at indexing time and at query time (the same analyzer).

    From:

    https://ravendb.net/learn/inside-ravendb-book/reader/4.0/10-static-indexes-and-other-advanced-options#full-text-search-queries

    The search() method accepts the query string you're looking for and passes it to the analyzer for the specified field. It then compares the terms the analyzer returned with the terms already in the index, and if there's a match on any of them, it's considered to be a match for the query.


    After investigation:
    NGram is an exception to the above rule.


    The default token length generated by the NGram analyzer is only 2-6 chars.
    So in your case, using your example, and the default NGram settings, the terms that are generated in the index from your 2 documents are:

    enter image description here

    Now, when you query with:

    var result = session.Query<Index_With_NGram_Analyzer.indexEntry, Index_With_NGram_Analyzer>()
                        .Search(x => x.Name, "xxxabcd")
                        .OfType<myName>()
                        .ToList();
    

    you will get 0 results because "xxxabcd" is Not passed via the NGram analyzer at query time,
    but via the StandardAnalyzer instead.

    Note again - this exception is only for the NGram analyzer.
    For any other analyzer used - the same analyzer will be used at query time.


    The way to go about this is either:


    For the sake of having a complete answer:

    Index definition is:

    public class Index_With_NGram_Analyzer : AbstractIndexCreationTask<myName>
    {
        public class indexEntry
        {
            public string Name; 
        }
        
        public Index_With_NGram_Analyzer()
        {
            Map = companies => from c in companies
                               select new indexEntry()
                               {
                                   Name = c.Name
                               };
    
            Index(r => r.Name, FieldIndexing.Search);
            Analyzers.Add(n => n.Name, nameof(NGramAnalyzer));
        }
    }
    
    public class myName
    {
        public string Name { get; set; }
    }
    

    Some other resources for FTS are:

    Full-Text Search
    https://ravendb.net/docs/article-page/6.0/csharp/client-api/session/querying/text-search/full-text-search

    Full-Text Search with Index
    https://ravendb.net/docs/article-page/6.0/csharp/indexes/querying/searching

    Demos
    https://demo.ravendb.net/demos/csharp/text-search/fts-with-static-index-single-field https://demo.ravendb.net/demos/csharp/text-search/fts-with-static-index-multiple-fields