hazelcast-jet

Hazlecast Jet Cluster Processes duplicates


I have deployed 3 spring boot apps with Hazelcast Jet embedded. The nodes recognize each other and run as a cluster. I have the following code: A simple reading from CSV and write to a file. But Jet writes duplicates to the file sink. To be precise, Jet processes total entries in the CSV multiplied by the number of nodes. So, if I have 10 entries in the source and 3 nodes, I see 3 files in the sink each having all 10 entries. I just want to process one record once and only once. Following is my code:

    Pipeline p = Pipeline.create();

    BatchSource<List<String>> source = Sources.filesBuilder("files")
            .glob("*.csv")
            .build(path -> Files.lines(path).skip(1).map(line -> split(line)));

    p.readFrom(source)
            .writeTo(Sinks.filesBuilder("out").build());
    instance.newJob(p).join();

Solution

  • If it's a shared file system, then sharedFileSystem attribute in FilesourceBuilder must be set to true.