sharepoint-2007propertiesindexingweb-crawlerifilter

Programmatically generate additional properties during SharePoint crawl


Is it possible to hook into the MOSS 2007 crawl process and programmatically populate a metadata property as the content is being indexed?

The reason I need to do this at crawl time is that the content is coming from outside SharePoint (from a file share) and so I can't add the metadata directly to the documents themselves. There's a wide variety of different document types, so a custom IFilter is not an option either.


Solution

  • You could do try using a custom protocol handler. This allows you to apply metadata to files regardless of their type. Pair this with a custom content source, and you can target a specific network share or set of shares.

    The material on protocol handlers (and property handlers) are found where File Filtering develop is covered but don't worry about that. The book below covers the difference pretty well.

    The Microsoft Windows Search 3.x SDK is a decent place to start. It has a sample IFilter implementation that captures properties from an XML file.

    A book I've found helpful is "Inside the Index and Search Engines: Microsoft Office SharePoint Server 2007" by Patrick Tisseghem and Lars Fastrup. Chapter 9 discusses the implementation and deployment of a custom Filter, protocol handler, and even a content source. This version of a protocol handler shows how to capture meta-data, e.g. modify date, from crawling a filesystem. By also defining a custom content source, you can capture file meta-data regardless of the type of files which is to your point about having lots of different file types to capture properties from.

    I found this forum/blog post on IFilter development pretty good. It has several links to other resources.

    This MSDN article on writing a filter for SharePoint is frequently mentioned and has a better explanation of the different aspects but the book I mentioned covers a broader range, i.e. the protocol handler.

    MSDN has a good overview of the indexing process.