We're using WSO2 DAS 3.1.0 to receive events from WSO2 API-Manager and send off to a database.
If we send maybe 70-100 events / second for 4-5 hours to DAS the performance slowly deteriorates and it starts "laging" behind. At first we suspected a problem pushing the resulting events to our database (we have an event-receiver, an execution-plan (that summarizes events / second) and a publisher to our database), but we've now concluded that this isn't an issue, the database has no problem keeping up with the load at all.
To isolate the problem we've for e.g. added an event publisher to file from the incoming event receiver (before we do any handling in our execution-plan) and we can see that when DAS performance deteriorates, for several seconds, there's no output for this publisher either; hence the problem is in handling incoming events (we've also added a queue between pushing events to our database to make sure there were no back-pressure propagating to the handling of incoming events).
The really strange part however is that when this behavior occurs (the performance handling incoming events in DAS deteriorates), there's no way to get out of it apart from restarting the entire server (then it starts working again without problem for several hours). Even if we stop sending events to the server for several days, when we start sending even 1-2 events to the server, it takes several seconds between handling all events (and thus straight away "lags" behind incoming events). It's as if the performance gets exponentially slower at handling incoming events until we restart DAS.
Would be very happy for any potential clues as to where to make changes for this behavior to not occur (purging internal events has no effect either).
After a lot of debugging we finally found the cause for this.
In our Siddhi-statements we use 'group by' with dynamically changing timestamps, which it turns out is handled extremely inefficient as described by this bug: https://github.com/wso2/siddhi/issues/431.
After patching the specified classes the problem disappeared (but currently there's still a bug where the product gets OOM since it doesn't release the dynamic 'group by' information).