[SOLVED] Configuring LucidWorks Include Paths to only crawl certain file types

Configuring LucidWorks Include Paths to only crawl certain file types

I'm trying to configure the LucidWorks web data source to only index certain file types. However, when I set Include paths to .*\.html to only crawl .html files (as a simplified example), it only ends up indexing the top level folder. Crawl depth is set to -1 and when I leave Include paths blank, it crawls the whole sub-tree as expected.

I've looked at their documentation for creating a web data source, and for Using Regular Expressions, and can't find a reason why .*\.html would not work, since .* should match any character.

Solution

As I was proofreading the question, I had an idea which was the correct solution. Posting it here for posterity.

The content being crawled is a file share, so it relies on directory listing of the web server, which was filtered out because it doesn't have a .html extension. So simply adding .*/ to the Include paths fixed the problem.