xhtmlindexingdocumentpayloadclucene

Index XML attributes along with plain text in CLucene


I've been able to compile the CLucene project on iOS and am currently trying to use it within my iOS application. I'm trying to index xhtml documents, and have been able to do that by extracting the text nodes out of those documents, and then index in lucene by concatenating them together, so as to all the text from one xhtml doc appears in a single Lucene document.

However, each text node of my xhtml document has custom attributes to it, so that when a search is made on the indexed text, I should be also be able to obtain the attribute associated with that text.

My xml data looks like:

<span data-value="/1/2/3">This is a sample text for this span</span>
<span data-value="/2/3/4">This is a example text for another span</span>
<span data-value="/3/4/5">Searching for this span text</span>

So when I search from the Lucene index for a word sample, then I should be able to retrieve the data-value attributed to the word Sample is associated. In the case above it will be data-value="/1/2/3".

The way I'm creating an index is by concatenating the data-value attribute and the text node field together, and then have it indexed by Lucene. This way, whenever my search results return, then it would also return the data-value attributed along with it. I can evaluate the attribute value, and at the time of searching will strip away this attribute from the display results altogether. However, this is not true for large text contained in a span text, where in the searched word(s) may be returned but the data-value attributes may not be part of the search results, which can further be stripped off while display.

However, I think this is not the optimal way of indexing XML attributes along with their text data.

I'd appreciate if someone can help me with the approach in order to index the relationship between the text and its attributes.

Update: I found that the tokens generated from the text can have payloads associated with them, so I'm thinking that if we can have the XML attribute built in as a payload for my entire string, which can be treated as a single token (if I don't analyze the text), can be useful for my purpose. I wanted to know if anyone can help me in figuring out if this is the right approach for my case. Many thanks for your help.

Thanks & regards, Asheesh


Solution

  • If you want to keep all of XHTML text as one Lucene document, then payloads are probably the way to go.

    An alternate approach is to create a document ID field (like "documentID:42" and a field denoting that this Lucene document is the whole document concatenated together (like "AllOfDocument:42"). This would let you index each text node individually and limit the attributes to just the attributes for that node, while still tying that text node to whole document. With that approach, you could put the attributes in their own field in the text node Lucene document rather than having to use payloads. Might be simpler.