ubuntuconsulhcl

Does Consul 1.16.1 properly implement os_service checks on Ubuntu 20.04?


Ubuntu: 20.04 qemu, Consul: 1.16.1, Vault 1.14.1

I am trying to use Consul to monitor Vault's systemD service, as per https://developer.hashicorp.com/consul/docs/services/usage/checks#osservice-check, but the result is a "not implemented" error.

My consul's service config:

services = [
{
  name = "vault"
  port = 8200
    checks = [
      {
        http = "vault1.foo.bar.com:8200/sys/health"
        interval = "15s"
        timeout = "10s"
      },
      {
        name = "Vault Service"
        os_service = "vault.service"
        interval = "15s"
      },
      {
        name = "Vault gRPC health check"
        grpc = "vault1.foo.bar.com:8201"
        grpc_use_tls = true
        interval = "10s"
      },
    ]
}
]

I've tried several iterations of the example entry, and bare-bones entries with only os_service and interval in them. Invariably, I get back near identical logs:

consul[82355]: ==> Starting Consul agent...
consul[82355]:                Version: '1.16.1'
consul[82355]:             Build Date: '2023-08-05 21:56:29 +0000 UTC'
consul[82355]:                Node ID: '3e269875-156b-d7e7-8cfa-2b84c9487ef9'
consul[82355]:              Node name: 'vault1'
consul[82355]:             Datacenter: 'west' (Segment: '')
consul[82355]:                 Server: false (Bootstrap: false)
consul[82355]:            Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: 8502, gRPC-TLS: -1, DNS: 8600)
consul[82355]:           Cluster Addr: 10.12.1.94 (LAN: 8301, WAN: 8302)
consul[82355]:      Gossip Encryption: true
consul[82355]:       Auto-Encrypt-TLS: false
consul[82355]:            ACL Enabled: false
consul[82355]:     ACL Default Policy: allow
consul[82355]:              HTTPS TLS: Verify Incoming: false, Verify Outgoing: false, Min Version: TLSv1_2
consul[82355]:               gRPC TLS: Verify Incoming: false, Min Version: TLSv1_2
consul[82355]:       Internal RPC TLS: Verify Incoming: false, Verify Outgoing: false (Verify Hostname: false), Min Version: TLSv1_2
consul[82355]: ==> Log data will now stream in as it occurs:
consul[82355]: 2023-08-30T21:46:51.171Z [WARN]  agent: skipping file /etc/consul.d/.vault.hcl.swp, extension must be .hcl or .json, or config format must be set
consul[82355]: 2023-08-30T21:46:51.171Z [WARN]  agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
consul[82355]: 2023-08-30T21:46:51.184Z [WARN]  agent.auto_config: skipping file /etc/consul.d/.vault.hcl.swp, extension must be .hcl or .json, or config format must be set
consul[82355]: 2023-08-30T21:46:51.184Z [WARN]  agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
consul[82355]: 2023-08-30T21:46:51.186Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: vault1 10.12.1.94
consul[82355]: 2023-08-30T21:46:51.186Z [INFO]  agent.router: Initializing LAN area manager
consul[82355]: 2023-08-30T21:46:51.189Z [WARN]  agent.client.serf.lan: serf: Failed to re-join any previously known node
consul[82355]: 2023-08-30T21:46:51.189Z [ERROR] agent: error creating OS Service client: error="not implemented"
consul[82355]: 2023-08-30T21:46:51.190Z [ERROR] agent: Error starting agent: error="Failed to register service \"vault\": not implemented"
consul[82355]: 2023-08-30T21:46:51.190Z [INFO]  agent: Exit code: code=1
systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: consul.service: Failed with result 'exit-code'.
systemd[1]: Failed to start "HashiCorp Consul - A service mesh solution".
systemd[1]: consul.service: Scheduled restart job, restart counter is at 5.
systemd[1]: Stopped "HashiCorp Consul - A service mesh solution".
systemd[1]: consul.service: Start request repeated too quickly.
systemd[1]: consul.service: Failed with result 'exit-code'.
systemd[1]: Failed to start "HashiCorp Consul - A service mesh solution".

Googling around for that produces thin results, and reading the code did not enlighten me.

Is this a bug? Pebcak?


Solution

  • I also struggled with this until I did some digging in the Consul source code: https://github.com/hashicorp/consul/blob/ac867d67e8240d64333483fdf3e234399740a189/agent/checks/os_service_unix.go#L15C43-L15C43

    type OSServiceClient struct {
    }
    
    func NewOSServiceClient() (*OSServiceClient, error) {
        return nil, fmt.Errorf("not implemented")
    }
    
    func (client *OSServiceClient) Check(serviceName string) error {
        return fmt.Errorf("not implemented")
    }
    

    It seems it's simply... not implemented. At least for non-windows systems. Interestingly, the documentation indicates it is available for systemd units.

    As a workaround, until it is made available, you can always execute systemctl is-active vault.service like so:

    {
      name = "Vault Service"
      args = [
        "systemctl",
        "is-active",
        "vault.service",
      ]
      interval = "15s"
    },