javapythonsearchlucenepylucene

How to get a list of all tokens from Lucene 8.6.1 index?


I have looked at how to get a list of all tokens from Solr/Lucene index? but Lucene 8.6.1 doesn't seem to offer IndexReader.terms(). Has it been moved or replaced? Is there an easier way than this answer?


Solution

  • Some History

    You asked: I'm just wondering if IndexReader.terms() has moved or been replaced by an alternative.

    The Lucene v3 method IndexReader.terms() was moved to AtomicReader in Lucene v4. This was documented in the v4 alpha release notes.

    (Bear in mind that Lucene v4 was released way back in 2012.)

    The method in AtomicReader in v4 takes a field name.

    As the v4 release notes state:

    One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term.

    The key part there is "per term within a single field". So from that point onward there was no longer a single API call to retrieve all terms from an index.

    This approach has carried through to later releases - except that the AtomicReader and AtomicReaderContext classes were renamed to LeafReader and LeafReaderContext in Lucene v 5.0.0. See Lucene-5569.

    Recent Releases

    That leaves us with the ability to access lists of terms - but only on a per-field basis:

    The following code is based on the latest release of Lucene (8.7.0), but should also hold true for the version you mention (8.6.1) - with the example using Java:

    private void getTokensForField(IndexReader reader, String fieldName) throws IOException {
        List<LeafReaderContext> list = reader.leaves();
    
        for (LeafReaderContext lrc : list) {
            Terms terms = lrc.reader().terms(fieldName);
            if (terms != null) {
                TermsEnum termsEnum = terms.iterator();
    
                BytesRef term;
                while ((term = termsEnum.next()) != null) {
                    System.out.println(term.utf8ToString());
                }
            }
        }
    }
    

    The above example assumes an index as follows:

    private static final String INDEX_PATH = "/path/to/index/directory";
    ...
    IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
    

    If you need to enumerate field names, the code in this question may provide a starting point.

    Final Note

    I guess you can also access terms on a per document basis, instead of a per field basis, as mentioned in the comments. I have not tried this.