solrapache-zookeepersolrcloud

Load balancing and indexing in SolrCloud


I have some questions regarding SolrCloud:

  1. If I send a request directly to a solr node, which belons to a solr cluster, does it delegate the query to the zookeeper ensemble to handle it?

  2. I want to have a single url to send requests to SolrCloud. Is there a better way of achieving this, than setting up an external load balancer, which balances directly between individual solr nodes? If 1 isn't true, this approach seems like a bad idea. On top I feel like it would somewhat defeat the purpose of zookeeper ensemble.

  3. There is an option to break up a collection in shards. If I do so, how exactly does SolrCloud decide which document goes to which shard? Is there a need and/or an option to configure this process?

  4. What happens if I send a collection of documents directly to one of the solr nodes? Would the data set somehow distribute itself across the shards evenly? If so, how does it happen?

Thanks a lot!


Solution

    1. Zookeeper "just" keeps configuration data available for all nodes - i.e. the state of the cluster, etc. It does not get any queries "delegated" to it; it's just a way for Solr nodes and clients to know which collections are handled by which nodes in the cluster, and have that information be stored in resilient and available manner (i.e. dedicate the hard part out of managing a cluster to Zookeeper).

    2. The best is to use a cloud aware Solr client - it will connect to any of the available Zookeeper nodes given in its configuration, retrieve the cluster state and connect directly to one the nodes that has the information it needs (i.e. the collection it needs to query). If you can't do that, you can either load balance with an external load balancer across all nodes in your cluster or let the client load balance if the client you use supports round robin, etc. - but having an external load balancer gives you other gains (such as being able to remove a node from load balancing for all clients at the same time, having dedicated http caching in front of th enodes, etc.) for a bit more administration.

    3. It will use the unique id field to decide which node a given document should be routed to. You don't have to configure anything, but you can tell Solr to use a specific field or a specific prefix of a field, etc. as the route key. See Document Routing. for specific information. It allows you to make sure that all documents that belong to a specific client/application is placed on the same node (which is important for some calculations and possible operations).

    4. It gets routed to the correct node. Whether that is evenly depends on your routing key, but by default, it'll be about as even as you can get it.