javasolrmanifoldcf

Best way to crawl through file system and index


I am working on a project where I need to crawl through more than 10TB of data and index it. I need to implement incremental crawling that takes less time.

My question is : Which is the best tool suitable that all the big organizations are using for this along with java?

I was trying it out using Solr and Manifold CF but Manifold has very little documentation on the internet.


Solution

  • We ended up using Solr J (JAVA) and Apache Manifold CF. Although the documentation for Manifold CF was little to none, we subscribed to the newsletter and asked questions to the developers and they responded quickly. However, I would not recommend anyone to use this setup as Apache Manifold CF is something that is outdated and poorly built. So better search for alternatives. Hope this helped somebody.