google-cloud-platformconcurrencypipelinedataflowsingle-threaded

Can I configure a Dataflow job to be single threaded?


I was trying to configure and deploy a Cloud Dataflow job that is truly single threaded to avoid concurrency issues while creating/updating entities in the datastore. I was under the assumption that using an n1-standard-1 machine ensures that the job is running on a single thread, on a single machine, but I have come to learn the hard that this is not the case.

I have gone over the suggestions mentioned in an earlier query here- Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

But I wanted to avoid implementing a windowing approach around this, and wanted to know if there is a simpler way to simply configure a job to ensure single threaded behavior.

Any suggestions or insights would be greatly appreciated


Solution

  • I have come to learn recently that single threaded behavior is guaranteed by using a single worker which is n1-standard-1 and additionally using the following exec_arg --numberOfWorkerHarnessThreads=1 as this restricts the number of JVM threads to 1 as well.