solrsolrcloudlog-analysislucidworks

Why are Solr's logs time series stored in different collections based on time instead of different shards based on time


If you see Lucidworks Time Based Partitioning or Large Scale Log Analytics with Solr, multiple solr "collections" are created partitioned on time.

My question is

  1. Why not in such cases just create multiple shards based on time ?
  2. In case of multiple collection, how would a query spanning multiple collections/time be done ?

Solution

  • There is not much difference between multiple shards with implicit routing or multiple collections. When you issue a query, you can (optionally) specify which shards or which collections to search.

    Alternatively you can set up an alias containing multiple collections, thus hiding the logistics from the search client. This makes it easy to create custom views over the full data set, such as an alias for each year, one for everything and one for the last quarter. If you at a later time decide to slice your data differently, e.g. make a collection for each week instead of each month, this change will be transparent to the client application. Aliases does not work for shards, so that is one reason to prefer collections.