javasearchlucenequery-parser

Problem with Proximity search Lucene. Field "content" was indexed without position data


so as in the title when I'm trying to search for a query i get an error

Exception in thread "main" java.lang.IllegalStateException: field "content" was indexed without position data; cannot run PhraseQuery (phrase=content:"to be not"~1) at org.apache.lucene.search.PhraseQuery$1.getPhraseMatcher(PhraseQuery.java:497) at org.apache.lucene.search.PhraseWeight.scorer(PhraseWeight.java:64) at org.apache.lucene.search.Weight.bulkScorer(Weight.java:166) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:731) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:655) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:649) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:487) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:501) at ProximitySearch.main(ProximitySearch.java:81)

Here is my code:

    public static void main(String[] args) throws IOException, ParseException {

        Analyzer analyzer = new StandardAnalyzer();

        List<KeyValuePairs> listOfDocs = new LinkedList<>();

        KeyValuePairs file1 = new KeyValuePairs("file1", "to be or not to be that is the question");
        KeyValuePairs file2 = new KeyValuePairs("file2", "make a long story short");
        KeyValuePairs file3 = new KeyValuePairs("file3", "see eye to eye");

        listOfDocs.add(file1);
        listOfDocs.add(file2);
        listOfDocs.add(file3);

        Path indexPath = Files.createTempDirectory("tempIndex");
        Directory directory = FSDirectory.open(indexPath);
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter iwriter = new IndexWriter(directory, config);
        for (KeyValuePairs listOfDoc : listOfDocs) {
            Document doc = new Document();
            String text = listOfDoc.getKey();
            System.out.println(text);
            String title = listOfDoc.getValue();
            doc.add(new StringField("content", text, Field.Store.YES));
            doc.add(new Field("title", title, TextField.TYPE_STORED));
            iwriter.addDocument(doc);
        }
        iwriter.close();

        // Now search the index:
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);

        // Parse a simple query that searches for "something that u want to search":
        QueryParser parser = new QueryParser("content", analyzer);
        Query query = parser.parse("\"to be not\"~1");

        ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
        System.out.println(Arrays.toString(Arrays.stream(hits).toArray()));
        System.out.println("Search terms found in :: " + hits.length + " files");

        ireader.close();
        directory.close();
        IOUtils.rm(indexPath);
    }

I dont know what am i doing wrong.


Solution

  • Short Answer

    You cannot run proximity queries for data stored in a StringField. You have to use a TextField.

    You did not show us the definition for KeyValuePairs, so I have made some assumptions below about that.

    (Small point: I would also suggest that you do not need to use LinkedList - you probably only need ArrayList.)


    Longer Answer for More Background

    Your problem is related to the field types you are using.

    You have a document containing 2 fields:

    An example of data in the content field is to be or not to be that is the question.

    You are attempting to run a proximity query against the content field.

    Remember from this question that StringField data "is indexed but not tokenized: the entire String value is indexed as a single token."

    A single token, means the token's position is always effectively the only position - and therefore position data is not captured in the index (it is basically meaningless).

    That is why your query throws that error. That query requires the data to be split up into separate tokens - and each token's position needs to be captured in the index.

    Therefore you need to use a TextField for that type of data.

    When you use a TextField for to be or not to be that is the question, then the StandardAnalyzer causes the following data to be captured in the index:

    field content
      term be
        doc 0
          freq 2
          pos 1
          pos 5
      term is
        doc 0
          freq 1
          pos 7
      term not
        doc 0
          freq 1
          pos 3
      term or
        doc 0
          freq 1
          pos 2
      term question
        doc 0
          freq 1
          pos 9
      term that
        doc 0
          freq 1
          pos 6
      term the
        doc 0
          freq 1
          pos 8
      term to
        doc 0
          freq 2
          pos 0
          pos 4
    

    You can see that the index now contains the required position data. The proximity query requires this position data to evaluate whether the words in your query are sufficiently close enough to each other, to match your query.

    And just for completeness, here is what you get in the index if you use StringField instead of TextField:

    doc 0
      field 0
        name content
        type string
        value to be or not to be that is the question
    

    As you can see - only one token - and no position data.