I am very new to both Apache Solr and Carrot2. I am trying to index lot of input files using Solr. The end goal is to cluster the documents.
I am not clear if the clustering is done by Solr or by carrot2 workbench?
Can anyone guide me in this?
It can be done both ways.
In one setting, Carrot2 Workbench can fetch search results from Solr (just like from any other search engine) and cluster them. This route is probably the easiest to start with, you just need to provide the URL to Solr service and names of fields to provide content for clustering.
Alternatively, you can configure a search results clustering plugin in Solr, which will perform clustering inside your Solr server and include search results clusters as part of Solr search response.
In both cases clustering is applied to the stored content of the documents (raw text), so there's not much performance benefit from having the documents clustered inside Solr, apart perhaps from reducing the serialization/deserialization overhead.
Finally, there is a somewhat outdated document clarifying the two Carrot2-Solr integration strategies.