I have a salt 2016.11.3 (Carbon) playground with a master in DigitalOcean and 4 minions in Azure (three ubuntu and 1 windows).
After a while ubuntu minions are not responding to salt -t 30 '*' test.ping
but they are online ( I can ssh into them )
Restarting the master systemctl restart salt-master
or minions systemctl restart salt-minion
seems to bring minions back for a while.
Things checked:
Also after restart I get a double response from re-added nodes but I think this is a cache problem because it disappears after some time (cache invalidation).
It seems like is a communication error. There is an older 2013 bug report on saltstack github repo and someone states in comments that AWS and Azure load balancers don't respect TCP keepalives.
Suggested solutions:
Until now solution #2 works for me.
tcp_keepalive: True
tcp_keepalive_idle: 60