I have to make a file finder using Lucene. I thought of using wild card query.
Text in a document: lucene/queryparser/docs/xml/img/plus.gif
Search string: lqdocspg
It should find:
lucene/queryparser/docs/xml/img/plus.gif
(it is the olny document for now, so it should return that it found 1 match.)
Here is my code:
public static void main(String[] args) throws IOException, ParseException {
Analyzer analyzer = new StandardAnalyzer();
Path indexPath = Files.createTempDirectory("tempIndex");
Directory directory = FSDirectory.open(indexPath);
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "lucene/queryparser/docs/xml/img/plus.gif";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
//QueryParser parser = new QueryParser("fieldname", analyzer);
//Query query = parser.parse("l*q*d*o*c*s*p*g*");
Query query = new WildcardQuery(new Term("fieldname", "*l"));
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
System.out.println(isearcher.doc(0).get("fieldname"));
System.out.println("Search terms found in :: " + hits.length + " files");
assertEquals(1, hits.length);
// Iterate through the results:
for (ScoreDoc hit : hits) {
Document hitDoc = isearcher.doc(hit.doc);
assertEquals("lucene/queryparser/docs/xml/img/plus.gif", hitDoc.get("fieldname"));
}
ireader.close();
directory.close();
IOUtils.rm(indexPath);
}
When i am passing *l
or l*
or lucene* *lucene
or any different like q*
it's working and returns that it found match in one file.
But when im trying to pass what i want to find which is *l*q*d*o*c*s*p*g*
it returns 0 found matches. I dont know what am i doing wrong. Asterisk means that between letters there can be anything right?
The problem here is you are using a TextField
:
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
This field, when used with the standard analyzer, causes the contents of that field to be tokenized into the following separate tokens in the Lucene index (in alphabetic order):
Input value: lucene/queryparser/docs/xml/img/plus.gif
Resulting indexed tokens:
docs
img
lucene
plus.gif
queryparser
xml
Now, when you consider your wildcard query l*q*d*o*c*s*p*g*
, you can see that this single query term does not match any of the tokens in the index.
But a query such as l*
matches one token: lucene
- so that will find your one and only document.
There are various ways to solve this - but in your specific case, given you want to treat lucene/queryparser/docs/xml/img/plus.gif
as a single string, then you can use a StringField
instead of a TextField
. This class does not split the input into tokens. Instead it indexes the input without applying the Standard Analyzer.
doc.add(new StringField(FIELD_NAME, documentBody, Field.Store.YES));
This generates a single token in the index:
lucene/queryparser/docs/xml/img/plus.gif
Now you can see that your wildcard query l*q*d*o*c*s*p*g*
should (and does) match that token.
See the following:
Field
- specifically, take a look at the list of subclasses. These are "predefined" fields you can use without needing to use new Field(...)
- like my new StringField(...)
example above.TextField
- "A field that is indexed and tokenized, without term vectors. For example this would be used on a 'body' field, that contains the bulk of a document's text."StringField
- "A field that is indexed but not tokenized: the entire String value is indexed as a single token."