devopskubernetes-helmssh-keysargocdgitops

ARGOCD ssh: handshake failed: read tcp 10.#.3.21:36808->20.#.#.#:22: read: connection reset by peer and failed to get git client for repo


created an argocd-application, mentioned two sources, it got sync ok status, but every few seconds start getting

ssh: handshake failed: read tcp 10.254.3.21:36808->20.41.6.26:22: read: connection reset by peer 
and failed to get git client for repo 

errors. Any Suggestions

project: default
destination:
  server: 'https://kubernetes.default.svc'
  namespace: akv2k8s
syncPolicy:
  automated:
    prune: true
    selfHeal: true
sources:
  - repoURL: 'http://charts.spvapi.no'
    targetRevision: 2.3.2
    helm:
      valueFiles:
        - $values/charts/akv2k8s.yaml
    chart: akv2k8s
  - repoURL: 'git@ssh.##.azure.com:v3/####'
    targetRevision: helm_chart_test
    ref: values

i have added repo-cred secret already with sshkey which works fine if i use just one repo as source.


Solution

  • Turns out the root cause is concurrency in the function LsRemote:

    func (m *nativeGitClient) LsRemote(revision string) (res string, err error) {
        for attempt := 0; attempt < maxAttemptsCount; attempt++ {
            res, err = m.lsRemote(revision)
            if err == nil {
                return
            } else if apierrors.IsInternalError(err) || apierrors.IsTimeout(err) || apierrors.IsServerTimeout(err) ||
                apierrors.IsTooManyRequests(err) || utilnet.IsProbableEOF(err) || utilnet.IsConnectionReset(err) {
                // Formula: timeToWait = duration * factor^retry_number
                // Note that timeToWait should equal to duration for the first retry attempt.
                // When timeToWait is more than maxDuration retry should be performed at maxDuration.
                timeToWait := float64(retryDuration) * (math.Pow(float64(factor), float64(attempt)))
                if maxRetryDuration > 0 {
                    timeToWait = math.Min(float64(maxRetryDuration), timeToWait)
                }
                time.Sleep(time.Duration(timeToWait))
            }
        }
        return
    }
    

    It seems that when 2 requests hit the repo concurrently, one of them fails. But this behavior does not start immediately, some amount of concurrent requests succeeds first.

    So this looks very much like a deliberate throttling by Azure DevOps.

    For now the resolution is to increase the maxAttemptsCount from the default of 1 to 50 by setting the ARGOCD_GIT_ATTEMPTS_COUNT environment variable.

    I observed the retry count to rise to 12 until it finally succeeds. Need to check if this throttling can be controlled. If not, maybe this ArgoCD code could be improved. For example, randomizing the pause between retries may yield better results.