I've been configuring http healthchecks for all my apps in marathon which are working nicely, the trouble is marathon will keep stepping in and restarting a container failing it's healthcheck and I won't know unless I happen to be looking in the Marathon UI.
Is there a way to retrieve all apps that have a failed healthcheck so I can send an email alert or similar?
Marathon exposes information about failing healthcheck with event bus so you can write a simple service that will consume Marathons HealthChecks Event ("eventType": "instance_health_changed_event"
) and translate it to metric, alert you name it.
For a reference I can recommend allegro/appcop. This is the service that scales down unhealthy applications. Its code could be easily altered to do what you want.