created an argocd-application, mentioned two sources, it got sync ok status, but every few seconds start getting
ssh: handshake failed: read tcp 10.254.3.21:36808->20.41.6.26:22: read: connection reset by peer
and failed to get git client for repo
errors. Any Suggestions
project: default
destination:
server: 'https://kubernetes.default.svc'
namespace: akv2k8s
syncPolicy:
automated:
prune: true
selfHeal: true
sources:
- repoURL: 'http://charts.spvapi.no'
targetRevision: 2.3.2
helm:
valueFiles:
- $values/charts/akv2k8s.yaml
chart: akv2k8s
- repoURL: 'git@ssh.##.azure.com:v3/####'
targetRevision: helm_chart_test
ref: values
i have added repo-cred secret already with sshkey which works fine if i use just one repo as source.
Turns out the root cause is concurrency in the function LsRemote:
func (m *nativeGitClient) LsRemote(revision string) (res string, err error) {
for attempt := 0; attempt < maxAttemptsCount; attempt++ {
res, err = m.lsRemote(revision)
if err == nil {
return
} else if apierrors.IsInternalError(err) || apierrors.IsTimeout(err) || apierrors.IsServerTimeout(err) ||
apierrors.IsTooManyRequests(err) || utilnet.IsProbableEOF(err) || utilnet.IsConnectionReset(err) {
// Formula: timeToWait = duration * factor^retry_number
// Note that timeToWait should equal to duration for the first retry attempt.
// When timeToWait is more than maxDuration retry should be performed at maxDuration.
timeToWait := float64(retryDuration) * (math.Pow(float64(factor), float64(attempt)))
if maxRetryDuration > 0 {
timeToWait = math.Min(float64(maxRetryDuration), timeToWait)
}
time.Sleep(time.Duration(timeToWait))
}
}
return
}
It seems that when 2 requests hit the repo concurrently, one of them fails. But this behavior does not start immediately, some amount of concurrent requests succeeds first.
So this looks very much like a deliberate throttling by Azure DevOps.
For now the resolution is to increase the maxAttemptsCount
from the default of 1 to 50 by setting the ARGOCD_GIT_ATTEMPTS_COUNT
environment variable.
I observed the retry count to rise to 12 until it finally succeeds. Need to check if this throttling can be controlled. If not, maybe this ArgoCD code could be improved. For example, randomizing the pause between retries may yield better results.