indexinglucenelucene.netmulti-level

Does Lucene support multilevel indexing?


The structure of data to index is like below:

{
  "EmailId":"1",  //should be stored
  "EmailText":"hello world",
  "Attachments": 
                {
                   "AttachmentId":"1",  //should be stored
                   "FileName": "hello.txt"  //should be stored
                   "AttachmentText":"this is first attachment text"
                },
                {
                   "AttachmentId":"2",
                   "FileName": "welcome.xlsx"
                   "AttachmentText":"this is second attachment text"
                }
}

I could maintain a separate index for email body and attachment text, but is there any way we could do a multilevel indexing like above to maintain a single index? I should be able to search a keyword in the AttachmentText and get back the AttachmentId and EmailId.

I am using Lucene.Net but if there is any solution in Lucene Java then it is absolutely fine.

Thank you in advance.


Solution

  • One approach:

    You can flatten your source data:

    doc1 contains:

    EmailId = 1, AttachmentId = 1, AttachmentText = this is first attachment text.

    doc2 contains:

    EmailId = 1, AttachmentId = 2, AttachmentText = this is second attachment text

    ... and so on.

    This is certainly not the only way to flatten your data. It depending on all of the types of searches you want to perform. There may be other suitable ways to flatten the data, also.


    Regarding the comment:

    duplicate EmailId will be returned [w]hen querying...

    Yes - I would say you can de-duplicate the results data (the Lucene doc hits) after running your query. It really depends on what you plan to do with your search results. If you want to display them to a user, then you can convert your "flat" results back into a hierarchy for that purpose.


    One extra point worth adding:

    Some flattening approaches may cause you to have a lot of duplicate indexed data - for example, if you want to search EmailText data. I would try to avoid that by having two different document structures:

    Document A: fields for searching attachment text:

    Document B: fields for searching email body text:

    This way, the data in each EmailText is not indexed more than once.

    One Lucene index can have multiple different documents. And as above, you can rebuild the hierarchical structure of your original data, when presenting the results (if you need/want to do that).

    Another approach would be a more generic structure - something like:

    Document fields:

    Here, only one doc structure is needed.