javaindexinglucene

Apache Lucene returns NaN as score when sorting by relevance


I want to order the results of my Apache Lucene search by relevance. But when I use SortField.FIELD_SCORE for sorting, the score of the resulting documents is always NaN. When I omit the sort parameter, the search works perfectly fine, and the result documents contain a valid score.

I use lucene-core 9.6.0 and lucene-analyzers-common 8.11.2 which are the most up to date versions in the Maven repository right now.

At first I thought, I messed up my index or query. But I'm able to reproduce the issue with the most simple implementation I can imagine:

public class LuceneSearch {
    public static void main(String[] args) {
        try {
            Directory directory = new ByteBuffersDirectory();
            
            try (IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig(new SimpleAnalyzer()))) {
                indexWriter.addDocument(createDocument("a very simple example"));
                indexWriter.addDocument(createDocument("another example"));
                indexWriter.addDocument(createDocument("hello world"));
            }

            IndexReader indexReader = DirectoryReader.open(directory);
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

            Query query = new TermQuery(new Term("value", "hello"));
            Sort sort = new Sort(SortField.FIELD_SCORE); // <<<< this causes the problem
            TopDocs topDocs = indexSearcher.search(query, 10, sort);
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
            }

            indexReader.close();
            directory.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static Document createDocument(String value) {
        Document document = new Document();
        document.add(new TextField("value", value, Field.Store.NO));
        return document;
    }
}

When I run this simple code, I get 2 : NaN. Without the sort parameter, I get 2 : 0.49662238. I have no idea what I'm missing here. Or could it be a bug in the library? Thanks for your help!

Edit: As @andrewJames stated in the comments, the ScoreDoc (actually FieldDoc) object contains a property fields which contains the score when using the sort parameter. After some testing, I found out that the actual score is identical in both cases (with/without sort parameter). So the sorting works correctly.


Solution

  • Short Answer

    Sorting will work the way you expect, using your provided Sort criterion. It is equivalent to the default "relevance" sort order used by Lucene.

    You can still access the relevance score, if you want to, by casting ScoreDoc to FieldDoc.


    Longer Answer

    The sort order defined by:

    Sort sort = new Sort(SortField.FIELD_SCORE);
    

    is the same as the default sort order - which sorts by score (relevance) from highest to lowest. So, documents will be ordered in the same way in both cases.

    But when you use an explicit sort, you can no longer access the score using scoreDoc.score, as noted in the question. Instead you only get NaN (not a number).

    2 : NaN
    

    However, you can still access the score (if you want to) by casting each ScoreDoc instance to a FieldDoc. We get FieldDocs because we have added a sort field to our search.

    FieldDoc extends ScoreDoc. It contains "information about how to sort the referenced document".

    In our case, there is only one sort field and it is the FIELD_SCORE value.

    So, to print the score, we can change this code:

    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
        System.out.println(scoreDoc.doc + " : " + scoreDoc.score);
    }
    

    to this:

    for (ScoreDoc scoreDoc : topFieldDocs.scoreDocs) {
        FieldDoc fieldDoc = (FieldDoc) scoreDoc;
        System.out.println(scoreDoc.doc + " : " + fieldDoc.fields[0]);
    }
    

    Now we will get the score printed, instead of NaN:

    2 : 0.49662238
    

    Speculation: I may be wrong, but I assume the original scoreDoc.score field is NaN because it doesn't make sense to calculate it and store it here, given there is no guarantee that the applied search will use SortField.FIELD_SCORE.

    I expect users will mostly want to sort by something other than score - and maybe optionally use score as a tie-breaker.

    But if FIELD_SCORE is used, then the score will be available in that field, instead.


    As an aside, instead of this:

    TopDocs topDocs = indexSearcher.search(query, 10, sort);
    

    You can use this:

    TopFieldDocs topFieldDocs = indexSearcher.search(query, 10, sort);
    

    This allows us to access SortField[] - the fields which were used for sorting results. This includes information about field types.