[SOLVED] Multiple IgniteDataStreamer Instances

Multiple IgniteDataStreamer Instances

IgniteDataStreamer's addData method states that "This method can be called from multiple threads in parallel to speed up streaming if needed." I'd like to know, is it also safe to use multiple IgniteDataStreamer instances to stream data into the same cache concurrently?

My use case is that I'm trying to optimize the pre-load of a large cache after my Ignite cluster starts up. I'm starting from something like this:

try (IgniteDataStreamer<K, V> streamer = ignite.dataStreamer("MyCache")) {
    try (Stream<Map.Entry<K, V>> stream = jdbcTemplate.queryForStream(
            "select * from foo where '2020-01-01'<=foo.time and foo.time<'2026-01-01'",
            MY_ROW_MAPPER)) {
        stream.forEach(entry -> streamer.addData(entry.getKey(), entry.getValue()));
    }
}

What I'd like to do is split this up into N jobs (IgniteRunnables), each responsible for loading/streaming (e.g.) one month's worth of data into the cache. Then I would distribute the jobs evenly across my cluster by submitting them to Ignite's compute API, and wait until they all finish.

My initial attempt seems to work mostly ok, except for the fact that sometimes I see warnings in my logs like this after the cache pre-load has finished:

(Log4J2Logger.java:523) Partition states validation has failed for group: MyCache, msg: Partitions update counters are inconsistent for Part...

From what I can tell, this logging is triggered by a partition map exchange (PME) event that happens when an unrelated IgniteAtomicLong gets initialized for the first time.

Solution

There should not be any reason that you cannot run separate instances on separate hosts all streaming 1 portion of the overall data set to the same cache. The limiting factor in this proposed architecture will most likely be the network interface of the database that you are retrieving data from. Hope that helps.