I have a small sqlite database (post code -> US city name) and I have a big S3 file of users. I would like to map every user to the city name associated to their postcode.
I follow the famous WordCount.java example but Im not sure how mapReduce works internally:
MapReduce is a framework for writing application to process the big data in parallel on large clusters of commodity hardware in reliable and fault tolerant manner. MapReduce executes on top of HDFS(Hadoop Distributed File System) in two different phases called map phase and reduce phase.
Answer to your question Is my mapper created once per s3 input file?
Mapper created equals to the number of splits and by default split is created equals to the number of block.
High level overview is something like
input file->InputFormat->Splits->RecordReader->Mapper->Partitioner->Shuffle&Sort->Reducer->final output
Example,
Answer to your 2nd Question: Below are the three standard life cycle method of Mapper.
@Override
protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// Filter your data
}
}
@Override
protected void setup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("calls only once at startup");
}
@Override
protected void cleanup(Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
System.out.println("calls only once at end");
}