Make a GitHub-like file finder using Lucene

I have to make a file finder using Lucene. I thought of using wild card query.

Text in a document: lucene/queryparser/docs/xml/img/plus.gif

Search string: lqdocspg

It should find:
lucene/queryparser/docs/xml/img/plus.gif (it is the olny document for now, so it should return that it found 1 match.)

Here is my code:

public static void main(String[] args) throws IOException, ParseException {
    Analyzer analyzer = new StandardAnalyzer();

    Path indexPath = Files.createTempDirectory("tempIndex");
    Directory directory = FSDirectory.open(indexPath);
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "lucene/queryparser/docs/xml/img/plus.gif";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();

    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    //QueryParser parser = new QueryParser("fieldname", analyzer);
    //Query query = parser.parse("l*q*d*o*c*s*p*g*");
    Query query = new WildcardQuery(new Term("fieldname", "*l"));
    ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
    System.out.println(isearcher.doc(0).get("fieldname"));
    System.out.println("Search terms found in :: " + hits.length + " files");
    assertEquals(1, hits.length);



    // Iterate through the results:
    for (ScoreDoc hit : hits) {
        Document hitDoc = isearcher.doc(hit.doc);
        assertEquals("lucene/queryparser/docs/xml/img/plus.gif", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();
    IOUtils.rm(indexPath);
}

When i am passing *l or l* or lucene* *lucene or any different like q* it's working and returns that it found match in one file. But when im trying to pass what i want to find which is *l*q*d*o*c*s*p*g* it returns 0 found matches. I dont know what am i doing wrong. Asterisk means that between letters there can be anything right?

Solution

The problem here is you are using a TextField:

doc.add(new Field("fieldname", text, TextField.TYPE_STORED));

This field, when used with the standard analyzer, causes the contents of that field to be tokenized into the following separate tokens in the Lucene index (in alphabetic order):

Input value: lucene/queryparser/docs/xml/img/plus.gif

Resulting indexed tokens:

docs
img
lucene
plus.gif
queryparser
xml

Now, when you consider your wildcard query l*q*d*o*c*s*p*g*, you can see that this single query term does not match any of the tokens in the index.

But a query such as l* matches one token: lucene - so that will find your one and only document.

There are various ways to solve this - but in your specific case, given you want to treat lucene/queryparser/docs/xml/img/plus.gif as a single string, then you can use a StringField instead of a TextField. This class does not split the input into tokens. Instead it indexes the input without applying the Standard Analyzer.

doc.add(new StringField(FIELD_NAME, documentBody, Field.Store.YES));

This generates a single token in the index:

lucene/queryparser/docs/xml/img/plus.gif

Now you can see that your wildcard query l*q*d*o*c*s*p*g* should (and does) match that token.

See the following:

Field - specifically, take a look at the list of subclasses. These are "predefined" fields you can use without needing to use new Field(...)- like my new StringField(...) example above.
TextField - "A field that is indexed and tokenized, without term vectors. For example this would be used on a 'body' field, that contains the bulk of a document's text."
StringField - "A field that is indexed but not tokenized: the entire String value is indexed as a single token."