I have a basic HA setup for Logstash - two identical nodes in two separate AWS availability zones. Each node runs a pipeline that extracts a dataset from DB cluster and then outputs it downstream it to ELasticSearch cluster for indexing. This works fine with one Logstash node, but two nodes running in parallel send the same data twice down to ES for indexing due to the fact that each node tracks :sql_last_value
separately. Since I use the same ID as the document ID across both nodes, all repeated data is simply updated instead of being inserted twice. In other words, there is 1 insert and 1 update per each dataset. This is, obviously, not very efficient and puts unnecessary load on ELK resources. It gets worse as additional Logstash nodes are added.
Does anyone know a better way of how parallel Logstash nodes should be set up, so each node doesn’t extract the same dataset if it’s been already extracted by another previous node? One poor man’s solution could be creating a shared NFS folder between Logstash nodes and having each node write :sql_last_value
there, but I am not sure what kind of side effect I may run into with this setup, especially under higher loads. Thank you!
We have the very same scenario: 3 logstash instances to ensure high availablility with serveral databases as data sources.
On each logstash instance install and enable the same jdbc-pipelines following this logic:
Here comes a simplified example for the easy case (id is part of the result set):
input{
jdbc{
...
statement => "select log_id, * from ..."
...
}
}
filter{...}
output{
elasticsearch{
...
index => "logs-%{+YYYY.MM.dd}"
document_id => "%{[log_id]}"
...
}
}
And here comes the variant when your data lacks uniqe identifiers and you need to generate a fingerprint
input{
jdbc{
...
statement => "select * from ..."
...
}
}
filter{
fingerprint {
method => "MD5"
concatenate_all_fields => true
}
}
output{
elasticsearch{
...
index => "logs-%{+YYYY.MM.dd}"
document_id => "%{[fingerprint]}"
...
}
}
In both ways, the documents will be created when they´re part of the resultset for one logstash instance. All other logstash instances will get the same documents at a later time. Using the id/fingerprint as _id will update the previously created documents instead of duplicationg your data.
Works well for us, give it a try!