I'm running into an issue using distcp to copy files - every copy fails with an IO Exception (Checksum mismatch), even if performing a simple copy within the cluster (i.e. hadoop distcp -pbugctrx /foo/bar /foo/baz
).
If forced to complete the copy using -skipcrccheck
, I can see that the checksum is different ( hdfs dfs -checksum
), but that this isn't being caused by a difference in the actual source data (hdfs dfs -cat | md5sum
returns matching checksums for source and destination).
I'm leery of disabling a data integrity check if I don't need to. Is there a better way to address this failing check than just ignoring it.
Both the source and target may be in different encryption zones. In that case also the checksum will fail