hadoopreplication-factor

hadoop distcp between clusters with different replication factors


As some background, we have 2 clusters which are currently used as production and development. As part of this, we are copying files (using hadoop distcp -update) from the production cluster to the development cluster after they have been produced by the live processes (ie it effectively also works as a DR cluster).

Hadoop version is the same on both clusters: Hadoop 2.6.0-cdh5.12.1

However, the development cluster only has about 65% of the storage capacity of the live cluster. To deal with that, we have a default replication factor of 3 for live and 2 for development.

I've noticed that the files that are being copied from live to development have a replication factor of 3. I've done some reading and think this is how it should be behaving, even if it's not how I'd like it to behave.

I have two questions off the back of this:

Thanks for your help.


Solution

  • I've done some testing and done the following:

    Disk space has started to fall, so I'm counting this as a success. Maybe one day I'll be able to claim I know what I'm doing.