mesosmesospheremesos-chronos

Isn't chronos a centralized scheduler?


Why chronos is called as distributed and fault-tolerant scheduler? As per my understanding there is only one scheduler instance running that manages job schedules.

As per Chronos doc, internally, the Chronos scheduler main loop is quite simple.

The pattern is as follows:

  1. Chronos reads all job state from the state store (ZooKeeper)

  2. Jobs are registered within the scheduler and loaded into the job graph for tracking dependencies.

  3. Jobs are separated into a list of those which should be run at the current time (based on the clock of the host machine), and those which should not.
  4. Jobs in the list of jobs to run are queued, and will be launched as soon as a sufficient offer becomes available.
  5. Chronos will sleep until the next job is scheduled to run, and begin again from step 1.

Experts please opine?


Solution

  • You can run Chronos as a single node (which is what you are talking about) but Chronos is designed to be run with multiple nodes each on different hosts (achieving HA via Zookeeper quorum). This follows the standard leader/follower methodology where only the leader is active and the follower(s) will redirect traffic to the leader. This is considered to be HA in many open source frameworks, including Mesos as seen here.

    Leader abdication or failure can occur, which is where Zookeeper comes in - Chronos leader election will occur after a failure with the leader, assuming quorum has been established and maintained prior to this event.

    See reference of multi nodes here and here.

    How leader election is specified: JobSchedulerElectionSpec.scala

    Leader redirection: RedirectFilter.scala