dockercontainerstaskswarm

Swarm task doesn't wait for container exit (when constraints don't meet or update)


recently I investigate topics around managing tasks by swarm. I don't review swarm source code yet, but it seems that swarm doesn't wait for the containers exit (on worker node) - it changes the task status to shutdown / rejectected (depending on reason to finish the tasks).

Is it (and how) possible to force swarm to really wait to exit of container that do the task?

I run in same time in separate console windows connected to my swarm cluster:

  1. on worker NODE_XYZ: watch -n 1 'docker ps -a --format "{{.Names}}\t{{.State}} \t{{.Status}}" | grep <part_of_the_tasks_name>' AND
  2. on manager node: `watch -n 1 'docker node ps NODE_XYZ --format "{{.Name}}\t{{.DesiredState}}\t{{.CurrentState}}" | grep <part_of_the_tasks_name>'

In next step, I remove label that is requied by the task constaints.

I observed that docker node ps lost tasks in very short time, but task containers are still running and visible by docker ps on worker node. (My task can take up to from seconds to many minutes to end dependend on given workload).

When I restore label required by the task constraint, in docker node ps tasks become visible, but in docker ps (on worker node) there are more tasks than should be - there are new running tasks and old (also running) to end.

When service is scaled down, swarm task waits for container correctly in remove desired state.

  1. I expect that swarm task has status, code, or something like that, that allow determine is task waiting for container exit when desired state is shutdown.
  2. I expect that swarm doesn't create new containers and respect cpu/memory reservations until old tasks are running.

Solution

  • I found two ways (workarounds) to detect that swarm task (and underlaying container) really finish its work:

    1. Use agent on every swarm node to detect state of container
    2. Analyze logs for last n minutes/seceonds (dependend on worklow) - when there is no logs or last log line contains given phrase (like Bye!) I assumed that container stops.

    After some investigation, I choose second option that checks last line of logs in given duration:

    func isStoppedCheckByLogs(
        ctx context.Context,
        noLogsTimeout time.Duration,    // when no logs found in this duration, we assume that task is ended
        endMessages []string,       // collection of messages like `Bye!` or `Stopped.` - they should be last messages before container stop.
        t swarm.Task,
        cli *docker.Client) (bool, string) {
    
        unix := time.Now().UTC().Unix()
    
        lastLog, err := cli.TaskLogs(ctx, t.ID, types.ContainerLogsOptions{
            ShowStdout: true,
            ShowStderr: true,
            Since:      fmt.Sprintf("%d", unix-int64(noLogsTimeout.Seconds())),
            Timestamps: false,
            Follow:     false,
            Tail:       "1",
            Details:    false,
        })
    
        if err != nil {
            return true, fmt.Sprintf("task logs error: %s", err)
        }
    
        defer lastLog.Close()
    
        lastData, err := io.ReadAll(lastLog)
    
        if err != nil {
            return true, fmt.Sprintf("read data error: %s", err)
        }
    
        lastLogLine := string(lastData)
    
        if len(lastLogLine) == 0 {
            return true, "last log line not found or empty"
        }
    
        for _, end := range endMessages {
            if len(end) == 0 {
                continue
            }
    
            if strings.Contains(lastLogLine, end) {
                return true, fmt.Sprintf("last log line '%s' contains '%s'", lastLogLine, end)
            }
        }
    
        return false, fmt.Sprintf("last log line: '%s'", lastLogLine)
    }