lucene

Lucene StringField or KeywordAnalyzer?


I'm still fairly green in my understanding of Lucene, but I understand that a StringField is indexed but not tokenised, so the original string is stored "as-is" rather than getting broken up. Does this mean that Lucene effectively ignores whatever analyzer is being used when indexing a StringField?

It seems that KeywordAnalyzer performs the same function as StringField, i.e. it stores the original text rather than tokenises it. When would you use one over the other? It seems odd to use (say) a TextField + KeywordAnalyzer when you could just use a StringField.


Solution

  • Does this mean that Lucene effectively ignores whatever analyzer is being used when indexing a StringField?

    Yes, that is correct.


    KeywordAnalyzer and StringField: When would you use one over the other?

    The most common scenario I am aware of for needing KeywordAnalyzer is when performing a search not necessarily when indexing. In this case, it's less about "using one over the other" - it's more about needing to use KeywordAnalyzer because of how you have chosen to use StringField.

    Consider the following scenario:

    You have a StringField named postal_code for data which naturally contains spaces - for example values such as SW1A 1AA. You want to preserve those spaces.

    After indexing, the field contains untokenized values (not SW1A and 1AA, but just SW1A 1AA). Let's assume there are also other fields which are tokenized in a more traditional way using the Standard Analyzer and TextField fields.

    Now you want to search your data.

    Let's assume you want to search on that postal_code field. It has to be an exact search, since the data is not tokenized. Let's ignore questions of upper/lower case for now.

    If you try to search for postal_code:"SW1A 1AA", your query will fail. In Java, at least, it will actually throw a runtime error. I assume the same is true of Lucene .Net also.

    This is because you have used a query containing "SW1A 1AA", which is a phrase - values contained in double-quotes.

    Phrases require position data to be created in the index, so that Lucene knows it needs to find the token SW1A immediately followed by the token 1AA. However, the StringField named postal_code does not generate any such positional data (unlike regular TextField fields).

    So instead, you can use the KeywordAnalyzer at search time for this specific field - and your search will work as expected.

    This example may seem a bit contrived - I would not disagree. And in Lucene there can often be more than one way to get what you want. It's probably more likely that you use StringField and make sure the data is cleaned up in some way before it is indexed - for example, spaces are removed from postal codes; or hyphens are removed from credit card numbers, and so on... In that case you are searching for a simple term, not a phrase.

    Other more experienced users of Lucene may have additional (more compelling) scenarios for when to use KeywordAnalyzer.


    Just to follow up on that point about positional data: Text fields, by default not only tokenize the data but also capture the position of each token in the original text - thus enabling a range of different types of querying. One example is proximity searches.

    The phrase "foo bar" in a search is just a specific type of proximity search.