lucenelucene.net

Lucene.net basics


I'm struggling with some of the basic concepts, so would be grateful for an explanation of which type of fields to use in what situation (e.g. String or Text), and any relevant parameters (e.g. Field.Store.YES|NO). It would also be helpful to understand which analysers to use and when. Here are the scenarios that I'm likely to need to support:

  1. My documents will include a username field, but I only want users to search for "full" strings within this. In other words, if there are usernames such as "Andy", "Andrew", "Andrea", etc., then a user will be required to search for a full name (e.g. "andrew"); partial text searches such as "and" should not be allowed. (This is low priority so I'm not too concerned if this isn't possible).

  2. Most of the time I will want to allow partial searches. Most fields will be fairly short, say <50 characters, if that matters. Can spaces be included in searches? E.g. if a field contains the text "The quick brown fox..." then I'd like to be able to search for "brown fox" (or even "own fo").

  3. The documents I'm storing will have numerous string fields - most will be short as above, but there could be one or two longer "notes" type fields (no more than 100 words). Rather than provide users with a way to search these individual fields, I'd instead like to provide a single "free text" textbox, which would be used to search across all of these string fields. Again this should allow partial matches. (I could just concatenate the strings then store in one searchable field, but I wondered if Lucene provides a "better" approach, especially as I'd still have to store the individual fields for later display purposes).

  4. There will be some fields that I want to store in the document for display purposes when displaying the search results, but don't need to be searched on (not just strings, but also numbers and dates, if that makes a difference).

Regarding 2 & 3, does Lucene require me to include wildcard characters in search text, e.g. adding a "*" at the start and end?

And finally, are text searches case-insensitive? If not then I could store two fields - one containing the lowercase text for search purposes, and one containing the original text (for display purposes).

I'm using Lucene.Net v4.8 if that makes a difference.


Solution

  • I need to note, up front, that you should really only be asking one question per question - and there are several separate questions contained in your question.

    Also, Stack Overflow does not really function very well (nor is it meant to) as a tutorial provider.

    Finally, there is nothing in your question which shows what you have already researched, or tried (and no code illustrating any specific problem you have encountered). You may have actually done a lot of research, but none of it is shown here.


    Having said all that, here are some notes and pointers:

    which type of fields to use in what situation (e.g. String or Text)

    Take a look at the documentation for Field, where there is a list of the core field types. For TextField you get "a field that is indexed and tokenized"; whereas for StringField, you get "a field that is indexed but not tokenized".

    So, TextField is the typical choice for most typical uses; but if you have a field containing, for example, an ID, or a SKU - and if you only ever want to search on that entire field's values - then StringField is a better choice. It never needs to be tokenized - so don't spend effort tokenizing it.

    I only want users to search for "full" strings... a user will be required to search for a full name

    I don't have a good approach for this requirement. Someone else may - but this is a great example of where you should ask a new, separate question (and show your attempt, expected results, and actual results).

    There are stemmers supported by Lucene which can - for example, allow you to search for fish and still find instances of fishing, fished etc. But I don't know of any such thing which can handle proper nouns (e.g. people's names).

    Also, Lucene supports synonyms (you could treat Andy as a synonym for Andrew), but you would have to collect those up-front. That is probaby unrealistic.

    Can spaces be included in searches?

    Yes - see the "Terms" section of the Query Parser Syntax documentation. There, it describes the following:

    So, potentially a user could enter hello dolly into a search field, or "hello dolly" for two different sets of results.

    or even "own fo"

    Take a look at ngram analyzers which allow you to split a single token into multiple sub-tokens. For example, the token "abc" can be split into "a", "ab", "abc", "b", "bc", "c".

    provide a single "free text" textbox

    Yes you can do this, without having to duplicate all the data in your index. You would take the "free text" search term provided by the user and you would build a single Lucene query which uses that search term across all fields.

    field_one:foo bar OR field_two:foo bar OR field_three...
    

    does Lucene require me to include wildcard characters in search text

    It does not require this - and there may often be better alternatives. (By the way, a wildcard at the start of a search term may be disallowed because of its performance implications.

    are text searches case-insensitive?

    It depends on what analyzer you use. For example, take a look at the StandardAnalyzer, where the documentation states:

    Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

    You can look at the documentation of each of those contained classes - but you can see that the StandardAnalyzer includes a LowerCaseFilter, where every token is converted to lowercase during indexing.

    So, in this case, if you index using the StandardAnalyzer then you also will want to pass your search terms through the same StandardAnalyzer to achieve a case-insensitive search (both the indexed terms and the searh terms will be lowercase).

    This StandardAnalyzer also contains a stop word list, by the way - so you can use that to remove and, or a and other terms which are just "noise" and not useful for serching.

    one containing the original text (for display purposes)

    The original (un-analyzed and un-tokenized) text can be stored alongside the indexed data. That is what Field.Store.YES|NO is designed to do. See the documentation for Field.Store, where it states:

    this is useful for short texts like a document's title which should be displayed with the results. The value is stored in its original form, i.e. no analyzer is used before it is stored.

    (Maybe worth noting: If the original data is stored somewhere else already (for example in a relational database) then you may not want to also store the original data in Lucene. Instead, just store the primary key of the relational data in Lucene using Field.Store. When Lucene returns a set of matches for a given search, you Lucene also returns that PK value as part of the its results. Before you display those results to the user, there is an extra step to retrieve the original text from the DB. You combine the two and then display that to the user.)