pythonsearchlucenefull-text-searchpylucene

How to get a list of all tokens from Lucene 8.6.1 index using PyLucene?


I have got some direction from this question. I first make the index like below.

import lucene
from  org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriterConfig, IndexWriter, DirectoryReader
from org.apache.lucene.store import SimpleFSDirectory
from java.nio.file import Paths
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.util import BytesRefIterator

index_path = "./index"

lucene.initVM()

analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
if len(os.listdir(index_path))>0:
    config.setOpenMode(IndexWriterConfig.OpenMode.APPEND)

store = SimpleFSDirectory(Paths.get(index_path))
writer = IndexWriter(store, config)

doc = Document()
doc.add(Field("docid", "1",  TextField.TYPE_STORED))
doc.add(Field("title", "qwe rty", TextField.TYPE_STORED))
doc.add(Field("description", "uio pas", TextField.TYPE_STORED))
writer.addDocument(doc)

writer.close()
store.close()

I then try to get all the terms in the index for one field like below.

store = SimpleFSDirectory(Paths.get(index_path))
reader = DirectoryReader.open(store)

Attempt 1: trying to use the next() as used in this question which seems to be a method of BytesRefIterator implemented by TermsEnum.

for lrc in reader.leaves():
    terms = lrc.reader().terms('title')
    terms_enum = terms.iterator()
    while terms_enum.next():
        term = terms_enum.term()
        print(term.utf8ToString())

However, I can't seem to be able to access that next() method.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-47-6515079843a0> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while terms_enum.next():
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

AttributeError: 'TermsEnum' object has no attribute 'next'

Attempt 2: trying to change the while loop as suggested in the comments of this question.

while next(terms_enum):
    term = terms_enum.term()
    print(term.utf8ToString())

However, it seems TermsEnum is not understood to be an iterator by Python.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-d490ad78fb1c> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while next(terms_enum):
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

TypeError: 'TermsEnum' object is not an iterator

I am aware that my question can be answered as suggested in this question. Then I guess my question really is, how do I get all the terms in TermsEnum?


Solution

  • I found that the below works from here and from test_FieldEnumeration() in the test_Pylucene.py file which is in pylucene-8.6.1/test3/.

    for term in BytesRefIterator.cast_(terms_enum):
        print(term.utf8ToString())
    

    Happy to accept an answer that has more explanation than this.