hadoop

Change block size of dfs file


My map is currently inefficient when parsing one particular set of files (a total of 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster.

Which command changes the block size when I upload? (Such as copying from local to dfs.)


Solution

  • I change my answer! You just need to set the fs.local.block.size configuration setting appropriately when you use the command line.

    hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location
    

    Original Answer

    You can programatically specify the block size when you create a file with the Hadoop API. Unfortunately, you can't do this on the command line with the hadoop fs -put command. To do what you want, you'll have to write your own code to copy the local file to a remote location; it's not hard, just open a FileInputStream for the local file, create the remote OutputStream with FileSystem.create, and then use something like IOUtils.copy from Apache Commons IO to copy between the two streams.