hadoopmapreducededuplication

Deciding key value pair for deduplication using hadoop mapreduce


I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.

The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.

Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.

How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?


Solution

  • If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.

    WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.

    Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value.

    I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils, but the javadocs says that this might be what works good for you.

    Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference

    1) WholeFileInputFormat

    2) WholeFileRecordReader

    Also including the grepcode link to MD5FileUtils