I'm trying to configure the LucidWorks web data source to only index certain file types. However, when I set Include paths to .*\.html
to only crawl .html files (as a simplified example), it only ends up indexing the top level folder. Crawl depth is set to -1
and when I leave Include paths blank, it crawls the whole sub-tree as expected.
I've looked at their documentation for creating a web data source, and for Using Regular Expressions, and can't find a reason why .*\.html
would not work, since .*
should match any character.
As I was proofreading the question, I had an idea which was the correct solution. Posting it here for posterity.
The content being crawled is a file share, so it relies on directory listing of the web server, which was filtered out because it doesn't have a .html extension. So simply adding .*/
to the Include paths fixed the problem.