hadoopmapreducehdfsp2plibtorrent

How can I directly calculate the magnetlink of a file on hdfs through hdfs MapReduce?


How can I directly calculate the magnetlink of a file or directory on hdfs through MapReduce?


Solution

  • The hard part to calculate in a magnet link is the info-hash. The info-hash is the SHA1-hash of the info-dictionary of the .torrent file. It is a bencoded structure that contains file names, file sizes, piece size and (importantly) a list of SHA-1 hashes of all pieces. The complicated part of building an info dictionary is calculating the piece hashes, so let's focus on that.

    Also, for multi-file torrents, all the file payload is, logically, concatenated for purposes of calculating and validating piece hashes. In your case, it sounds like you are mostly interested in single file torrents, which keeps things a bit simpler, so let's just focus on that.

    I'm not familiar with HDFS specifically, but calculating the SHA-1 piece hashes can be done in parallel. You decide the piece size, technically it can be any size, but I would think a lot of clients expect a power of two (or at the very least it being divisible by 16 kiB). So, the hashes have to be computed on chunks of the file divisible by the piece size. e.g. if you set the piece size to 1 MiB, you have to hash the 1st megabyte of the file, the 2nd megabyte of the file and so on. This can all be done in parallel.

    For multi-file torrents this gets a bit more complicated, as those piece boundaries my no longer fall at file offsets divisible by the piece size. If HDFS doesn't give you arbitrary parallel access to files, that might be a problem.

    Once you have the piece hashes, put them in the info-dictionary and run one last SHA-1 pass over that, and you have the info-hash.

    If you want to support bittorrent v2, the files are validated by a merkle hash tree, using SHA-256 and 16 kiB leaf blocks in the file. This also has the benefit that each file is hashed independently, so you avoid the piece alignment issues.

    In short, as long as you can read and hash blocks of a power of two in parallel, your reduce-step is simply to build the info dictionary, put the piece hashes in there and SHA-1 hash it again.