mesosmesospheremesos-chronos

Apache Mesos/Chronos task status is not getting updated and stuck as RUNNING status


I am using Mesos 1.3.1 and Chronos in my local. I currently have 100 jobs scheduled every 30 minutes for testing.

Sometimes the tasks get stuck in RUNNING status forever until I restart the Mesos agent that the task is stuck. No agent restarted during this time.

I have tried to KILL the task but its status never gets updated to KILLED while the logs in Chronos say that successfully received the request. I have checked in Chronos that it did update the task as successful and end time is also correct but duration is ongoing and the task is still in RUNNING state.

Also the executor container is running forever for the task that are stuck. I have the executor container that will sleep for 20 seconds and set the offer_timeout to 30 seconds and executor_registration_timeout to 2 minutes.

I have also included Mesos reconciliation every 10 minutes but it updates the task as RUNNING every time.

I have also tried to force the task status to update again as FINISHED before the reconciliation but still not getting updated as FINISHED. It seems like the Mesos leader is not picking up the correct status for the stuck task.

I have tried to run with different task resource allocations (cpu: 0.5,0.75,1...) but does not solve the issue. I changed the number of jobs to 70 for every 30 minute but still happening. This issue is seen once per day which is very random and can happen to any job.

How can I remove this stuck task from the active tasks without restarting the Mesos agent? Is there a way to prevent this issue from happening?


Solution

  • Currently there is a known issue in Docker in Linux where the process exited but docker container is still running. https://github.com/docker/for-linux/issues/779

    Because of this, the executor containers are stuck in running state and Mesos is unable to update the task status.

    My issue was similar to this: https://issues.apache.org/jira/browse/MESOS-9501?focusedCommentId=16806707&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16806707

    The fix for the work around has been applied after 1.4.3 version. After upgrading the Mesos version this does not occur anymore.