[SOLVED] How to limit the amount of files produced by apache gobblin's output?

How to limit the amount of files produced by apache gobblin's output?

I am currently using apache gobblin to read from a kafka topic. I went over the docs to check if there is a config to limit the amount of files produced by gobblin but couldnt find it.

Is it possible to limit this?

Thanks!

Solution

There is no config to directly control the number of files produced by Gobblin for Kafka -> data lake ingestion. There are a few factors that determine the number of files output: 1. number of workunits created, and 2. whether your pipeline is using a PartitionedDataWriter. In the case of partitioned writes, the number of files is ultimately determined by the input data stream. For instance, if your pipeline is configured using a TimeBasedAvroWriterPartitioner (which is commonly used to write out files in YYYY/MM/DD/HH format) with the event time of the Kafka messages as the partitioning key, you will end up with lots of small files in your destination system if your input Kafka stream has a ton of late data.

However, you do have a few configurations to limit the number of workunits created by the Kafka source in a given run. In the case of Kafka, each workunit corresponds to a subset of topic partitions of a single topic assigned to a single Gobblin task.

mr.job.max.mappers: which limits how many mappers (or Gobblin tasks) are created in each run (and thus, limits the total number of workunits), and
mr.target.mapper.size: which intuitively maps to the maximum number of records each Gobblin task will pull in a single run.

You can reduce the first config and set the second config to a larger value, which will have the desired effect of reducing number of workunits and hence, the number of output files.

In addition to the above configs, Gobblin also has a compaction utility (a MapReduce job) that coalesces small files produced by the data ingestion pipeline into a small number of large sized files. A common production set up is to run the compaction on an hourly/daily cadence to limit the number of files in the data lake. See: https://gobblin.readthedocs.io/en/latest/user-guide/Compaction/ for more details.