I am trying to get all of terms and related postings which called Terms
from a Lucene`s document field(i.e. How to calculate term frequeny in Lucene?). According to documentation there is a method to do that:
public final Terms getTermVector(int docID, String field) throws IOException
Retrieve term vector for this document and field, or null if term vectors were not indexed. The returned Fields instance acts like a single-document inverted index (the docID will be 0).
There is a field called int docID
. What is this?? for a given document what is the id field of that and how does Lucene recognize that?
According to Lucene's documentation i have used StringField
as id and it is not a int
.
import org.apache.lucene.document.*;
Document doc = new Document();
Field idField = new StringField("id",post.Id,Field.Store.YES);
Field bodyField = new TextField("body", post.Body, Field.Store.YES);
doc.add(idField);
doc.add(bodyField);
I have five question accordingly:
id
field is used as docId
for this document? or even Lucene does it or not ??String
for id but this method give a int
. Does it cause a problem? TextField
. Is there any way to retrieve term vector(Terms
) of that field? I don't want to re-index my doc as explained here, because it is too large (35-GB).TextField
?To calculate term frequency we can use IndexReader.getTermVector(int docID ,String field)
. int docID
is a field which refers to document id created by Lucene. You can retrieve docID
by the code follow:
String index = "index/AIndex/";
String query = "the query text"
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("docField", analyzer);
Query lQuery = parser.parse(query);
]TopDocs results = searcher.search(lQuery , requiredHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = (int) results.totalHits.value;
for (int i = start; i < numTotalHits; i++)
{
int docID = hits[i].doc;
Terms termVector = reader.getTermVector(docID, "docField");
}
Each termVector
object have term and frequency related to a document field and you can retrieve that by the following code:
private HashMap<String,Long> termsFrequency = new HashMap<>();
TermsEnum itr = termVector.iterator();
int allTermFrequency=0;
BytesRef term;
while ((term = itr.next()) != null){
String termText = term.utf8ToString();
long tf = itr.totalTermFreq();
termsFrequency.put(termText, tf);
allTermFrequency += itr.totalTermFreq();
}
Note: Don't forget to set store term vector as i explained here (Or this one) when you are indexing documents. If you index your document without setting to store term vector, the method getTermVector
will return null
. All kind of predefind Lucene Field deactivated this option by default. So you need to set it.