regexweb-crawlerlucidworks

Configuring LucidWorks Include Paths to only crawl certain file types


I'm trying to configure the LucidWorks web data source to only index certain file types. However, when I set Include paths to .*\.html to only crawl .html files (as a simplified example), it only ends up indexing the top level folder. Crawl depth is set to -1 and when I leave Include paths blank, it crawls the whole sub-tree as expected.

I've looked at their documentation for creating a web data source, and for Using Regular Expressions, and can't find a reason why .*\.html would not work, since .* should match any character.


Solution

  • As I was proofreading the question, I had an idea which was the correct solution. Posting it here for posterity.


    The content being crawled is a file share, so it relies on directory listing of the web server, which was filtered out because it doesn't have a .html extension. So simply adding .*/ to the Include paths fixed the problem.