recently I investigate topics around managing tasks by swarm. I don't review swarm source code yet, but it seems that swarm doesn't wait for the containers exit (on worker node) - it changes the task status to shutdown / rejectected (depending on reason to finish the tasks).
Is it (and how) possible to force swarm to really wait to exit of container that do the task?
I run in same time in separate console windows connected to my swarm cluster:
watch -n 1 'docker ps -a --format "{{.Names}}\t{{.State}} \t{{.Status}}" | grep <part_of_the_tasks_name>'
ANDIn next step, I remove label that is requied by the task constaints.
I observed that docker node ps
lost tasks in very short time, but task containers are still running and visible by docker ps
on worker node. (My task can take up to from seconds to many minutes to end dependend on given workload).
When I restore label required by the task constraint, in docker node ps
tasks become visible, but in docker ps
(on worker node) there are more tasks than should be - there are new running tasks and old (also running) to end.
When service is scaled down, swarm task waits for container correctly in remove
desired state.
shutdown
.I found two ways (workarounds) to detect that swarm task (and underlaying container) really finish its work:
Bye!
) I assumed that container stops.After some investigation, I choose second option that checks last line of logs in given duration:
func isStoppedCheckByLogs(
ctx context.Context,
noLogsTimeout time.Duration, // when no logs found in this duration, we assume that task is ended
endMessages []string, // collection of messages like `Bye!` or `Stopped.` - they should be last messages before container stop.
t swarm.Task,
cli *docker.Client) (bool, string) {
unix := time.Now().UTC().Unix()
lastLog, err := cli.TaskLogs(ctx, t.ID, types.ContainerLogsOptions{
ShowStdout: true,
ShowStderr: true,
Since: fmt.Sprintf("%d", unix-int64(noLogsTimeout.Seconds())),
Timestamps: false,
Follow: false,
Tail: "1",
Details: false,
})
if err != nil {
return true, fmt.Sprintf("task logs error: %s", err)
}
defer lastLog.Close()
lastData, err := io.ReadAll(lastLog)
if err != nil {
return true, fmt.Sprintf("read data error: %s", err)
}
lastLogLine := string(lastData)
if len(lastLogLine) == 0 {
return true, "last log line not found or empty"
}
for _, end := range endMessages {
if len(end) == 0 {
continue
}
if strings.Contains(lastLogLine, end) {
return true, fmt.Sprintf("last log line '%s' contains '%s'", lastLogLine, end)
}
}
return false, fmt.Sprintf("last log line: '%s'", lastLogLine)
}