[SOLVED] Using multiple pipelines in Logstash with beats input

Using multiple pipelines in Logstash with beats input

As per an earlier discussion (Defining multiple outputs in Logstash whilst handling potential unavailability of an Elasticsearch instance) I'm now using pipelines in Logstash in order to send data input (from Beats on TCP 5044) to multiple Elasticsearch hosts. The relevant extract from pipelines.yml is shown below.

- pipeline.id: beats
  queue.type: persisted
  config.string: |
          input {
                  beats {
                        port => 5044
                        ssl => true
                        ssl_certificate_authorities => '/etc/logstash/config/certs/ca.crt'
                        ssl_key => '/etc/logstash/config/certs/forwarder-001.pkcs8.key'
                        ssl_certificate => '/etc/logstash/config/certs/forwarder-001.crt'
                        ssl_verify_mode => "force_peer"
                        }
                }
           output { pipeline { send_to => [es100, es101] } }

- pipeline.id: es100
  path.config: "/etc/logstash/pipelines/es100.conf"
- pipeline.id: es101
  path.config: "/etc/logstash/pipelines/es101.conf"

In each of the pipeline .conf files I have the related virtual address i.e. the file /etc/logstash/pipelines/es101.conf includes the following:

input {
  pipeline {
    address => es101
  }
}

This configuration seems to work well i.e. data is received by each of the Elasticsearch hosts es100 and es101.

I need to ensure that if one of these hosts is unavailable, the other still receives data and thanks to a previous tip, I'm now using pipelines which I understand allow for this. However I'm obviously missing something key in this configuration as the data isn't received by a host if the other is unavailable. Any suggestions are gratefully welcomed.

Solution

Firstly, you should configure persistent queues on the downstream pipelines (es100, es101), and size them to contain all the data that arrives during an outage. But even with persistent queues logstash has an at-least-once delivery model. If the persistent queue fills up then back-pressure will cause the beats input to stop accepting data. As the documentation on the output isolator pattern says "If any of the persistent queues of the downstream pipelines ... become full, both outputs will stop". If you really want to make sure an output is never blocked because another output is unavailable then you will need to introduce some software with a different delivery model. For example, configure filebeat to write to kafka, then have two pipelines that read from kafka and write to elasticsearch. If kafka is configured with an at-most-once delivery model (the default) then it will lose data if it cannot deliver it.