As some background, we have 2 clusters which are currently used as production and development. As part of this, we are copying files (using hadoop distcp -update) from the production cluster to the development cluster after they have been produced by the live processes (ie it effectively also works as a DR cluster).
Hadoop version is the same on both clusters: Hadoop 2.6.0-cdh5.12.1
However, the development cluster only has about 65% of the storage capacity of the live cluster. To deal with that, we have a default replication factor of 3 for live and 2 for development.
I've noticed that the files that are being copied from live to development have a replication factor of 3. I've done some reading and think this is how it should be behaving, even if it's not how I'd like it to behave.
I have two questions off the back of this:
Thanks for your help.
I've done some testing and done the following:
hadoop distcp -update $SOURCE $TARGET
to hadoop distctp -D dfs.replication=2 -update $SOURCE $TARGET
hdfs dfs -setrep -w 2 $TARGET
to amend the replication factor.Disk space has started to fall, so I'm counting this as a success. Maybe one day I'll be able to claim I know what I'm doing.