Are there any alerting options for scenarios where a Kafka Connect Connector or a Connector task fails or experiences errors?
We have Kafka Connect running, it runs well, but we've had errors that need to be manually traced and discovered. And often, it's been in an error state for a week before a human notices a problem.
Since this post was written/answered, Kafka Connect began providing its own official metrics. The Apache Kafka Connect provides metrics in legacy JMX format.
If you use the Confluent Kafka Connect Helm Charts (https://github.com/confluentinc/cp-helm-charts/tree/master/charts/cp-kafka-connect), they include a Prometheus metrics exporter.
I monitor and alert on cp_kafka_connect_connect_connector_metrics{status="running"}
from the Confluent Helm Chart Prometheus chart, but there are many variations to that.
Using the official Kafka Connect metrics is generally preferable for any automated monitoring + alerting setup. This option wasn't available back when this post was written + answered.
FYI, Kafka still doesn't expose lag metrics, so you still need third party options to monitor and alert on lag.