javahadoopmapreduce

Does mapreduce program consumes all the files (input dataset) in a folder by default?


Hello good folks at Stackoverflow,

I ran a mapreduce code that finds the unique words in a file. The input dataset (file) was in a folder in HDFS. So I gave the name of the folder as input when I ran the mapreduce program.

I didn't realize that there were another 2 more files in the same folder. Mapreduce program went ahead and read thru all the 3 files and gave the output. The output is fine.

Is this the default behaviour of mapreduce? Meaning if you point to a folder and not just a file (as input dataset), the mapreduce consumes all the files in that folder? The reason I am surprised is that in the mapper, there is no code to read multiple files. I understand that the first argument args[0] in the driver program is the folder name I gave.

This is the driver code:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DataSort {

     public static void main(String[] args) throws Exception {

/*
 * Validate that two arguments were passed from the command line.
 */
if (args.length != 2) {
  System.out.printf("Usage: StubDriver <input dir> <output dir>\n");
  System.exit(-1);
}

Job job=Job.getInstance();

/*
 * Specify the jar file that contains your driver, mapper, and reducer.
 * Hadoop will transfer this jar file to nodes in your cluster running 
 * mapper and reducer tasks.
 */
job.setJarByClass(DataSort.class);

/*
 * Specify an easily-decipherable name for the job.
 * This job name will appear in reports and logs.
 */
job.setJobName("Data Sort");

/*
 * TODO implement
 */
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(ValueIdentityMapper.class);
job.setReducerClass(IdentityReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

/*
 * Start the MapReduce job and wait for it to finish.
 * If it finishes successfully, return 0. If not, return 1.
 */
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
  }
}

Mapper Code:

import java.io.IOException;  
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class ValueIdentityMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

 @Override
  public void map(LongWritable key, Text value, Context context)
  throws IOException, InterruptedException {

    String line=value.toString();
    for (String word:line.split("\\W+"))
    {
        if (word.length()>0)
        {
            context.write(new Text(word),new IntWritable(1));
        }
    }

 }

}

Reducer Code:

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class IdentityReducer extends Reducer<Text, IntWritable, Text, Text>    {

 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
  throws IOException, InterruptedException {

    String word="";
    context.write(key, new Text(word));
  }
 }

Solution

  • Is this the default behaviour of mapreduce?

    Not of mapreduce, just of the InputFormat you used.

    FileInputFormat API Reference

    setInputPaths(JobConf conf, Path... inputPaths)

    Set the array of Paths as the list of inputs for the map-reduce job.

    Path API Reference

    Names a file or directory in a FileSystem.

    So, when you say

    there is no code to read multiple files

    Yes, there actually is, use just didn't need to write it.

    Mapper<LongWritable, Text, properly handles all file-offsets for all files in the specified InputFormat.