lucenesolar

How to do Lucene search with spaceless query?


Document document = new Document();
document.add(new Field("ID", "100", Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("TEMPLATE_CONTENT", "dummy Just {#var#} testing a spaceless {#var#} setup dummy",
                Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(document);

am indexing dummy Just {#var#} testing a spaceless {#var#} setup dummy using lucene while am querying am using below spaceless sentance

dummyJustatestingaspacelessfreakingsetupdummy

                      or 

dummyjustatestingaspacelessfreakingsetupdummy

am not able to get a single match with above TEMPLATE_CONTENT

using the below code to serch

        query = new QueryParser(Version.LUCENE_36, "TEMPLATE_CONTENT", new StandardAnalyzer(Version.LUCENE_36))
                .parse(serchQuery);
        searcher = new IndexSearcher(index, true);
        System.out.println("......query : " + query + "\n");
        long startTime = System.currentTimeMillis();
        results = searcher.search(query, 2);
        long endTime = System.currentTimeMillis();
        System.out.println("results time taken" + (endTime - startTime) + " ms");
        for (ScoreDoc scoreDoc : results.scoreDocs) {
            System.out.println("scoreDoc : " + scoreDoc);
            Document document = searcher.doc(scoreDoc.doc);
            System.out.println("Found match: " + document.get("TEMPLATE_CONTENT") + "\n");}

Please help me to get at lease one match


Solution

  • Could you follow this approach and see if this helps?

    To ensure that you can match spaceless sentences during searching, you need to analyze and index the text in a way that preserves the spaceless format. One way to achieve this is to use a custom analyzer that doesn't tokenize on whitespace.

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.Tokenizer;
    import org.apache.lucene.analysis.core.LowerCaseFilter;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    
    public class SpacelessAnalyzer extends Analyzer {
        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new SpacelessTokenizer();
            TokenStream filter = new LowerCaseFilter(tokenizer);
            return new TokenStreamComponents(tokenizer, filter);
        }
    
        private static class SpacelessTokenizer extends Tokenizer {
            private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    
            @Override
            public boolean incrementToken() {
                clearAttributes();
                try {
                    // Read the entire input as a single token
                    char[] buffer = new char[256];
                    int length = input.read(buffer);
                    if (length > 0) {
                        termAtt.append(buffer, 0, length);
                        return true;
                    }
                } catch (Exception e) {
                    // handle catch
                }
                return false;
            }
        }
    }

    Now you can use the analyzer when indexing your document:

    Analyzer analyzer = new SpacelessAnalyzer();
    document.add(new Field("TEMPLATE_CONTENT", "dummy Just {#var#} testing a spaceless {#var#} setup dummy",
                    Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));

    and when searching:

    QueryParser queryParser = new QueryParser(Version.LUCENE_36, "TEMPLATE_CONTENT", new SpacelessAnalyzer());
    Query query = queryParser.parse(searchQuery);

    With this, you should now be able to index and search spaceless sentences