[SOLVED] NIFI: limit number of concurrent tasks of a NIFI processor in a NIFI-Cluster

NIFI: limit number of concurrent tasks of a NIFI processor in a NIFI-Cluster

The question says it all. How can I do one of the following things:

How can I limit the number of concurrent tasks running for one processor cluster-wide?
Is there any unique and short ID for the Node, I run on? I could use these ID to append to the database-table-name to load (see details below) and have an exclusive table per connection.

I have a NIFI cluster and a self-written, specialized Processor, that loads heavy amounts of data into a database via JDBC (up to 20Mio rows per Second). It uses some of the database-vendor specific tuning tricks to be really fast in my particular case. One of these tricks needs an exclusive, empty table to load into for each connection.

At the moment, my processor opens one connection per Node in the NIFI-Cluster (it takes a connection from the DBCPConnectionPool). With about 90-100 nodes in the cluster, I'd get 90-100 connections - all of them bulk loading data at the same time.

I'm using NIFI 1.3.0.0

Any help or comment is highly appreciated. Sorry for not showing any code. It's about 700 lines not really helping with the question. But I plan to put it on Git and as part of the open-source project Kylo.

Solution

A common way of breaking up tasks in NiFi is to split the flow file into multiple files on the primary node. Then other nodes would pull one of the flow files and process it.

In your case, each file would contain a range of values to pull from the table. Let's say you had a hundred rows and wanted only 3 nodes to pull data. So you'd create 3 flow files each having separate attribute values:

start-row-id=1, end-row-id=33
start-row-id=34, end-row-id=66
start-row-id=67, end-row-id=100

Then a node would pick up a flow file from a remote process group or a queue (such as JMS or SQS). There's only 3 flow files so no more than 3 nodes would being loading data from a connection.