I am implementing my own Writable for Hadoop secondary sort, but when running the job, Hadoop keeps throwing EOFException in my readFields
method and I don't know what's wrong with it.
Error stack trace:
java.lang.Exception: java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)
Caused by: java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:165)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:158)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47)
at writable.WikiWritable.readFields(WikiWritable.java:39)
at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:158)
... 12 more
My code:
package writable;
import org.apache.hadoop.io.*;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class WikiWritable implements WritableComparable<WikiWritable> {
private IntWritable docId;
private IntWritable position;
public WikiWritable() {
this.docId = new IntWritable();
this.position = new IntWritable();
}
public void set(String docId, int position) {
this.docId = new IntWritable(Integer.valueOf(docId));
this.position = new IntWritable(position);
}
@Override
public int compareTo(WikiWritable o) {
int result = this.docId.compareTo(o.docId);
result = result == 0 ? this.position.compareTo(o.position) : result;
return result;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
docId.write(dataOutput);
position.write(dataOutput); // error here
}
@Override
public void readFields(DataInput dataInput) throws IOException {
docId.readFields(dataInput);
position.readFields(dataInput);
}
public IntWritable getDocId() {
return docId;
}
public int getPosition() {
return Integer.valueOf(position.toString());
}
}
// Driver
public class Driver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Path wiki = new Path(args[0]);
Path out = new Path(args[1]);
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "myjob");
TextInputFormat.addInputPath(job, wiki);
TextOutputFormat.setOutputPath(job, out);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(WikiWritable.class);
job.setJarByClass(Driver.class);
job.setMapperClass(WordMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setReducerClass(WordReducer.class);
job.setPartitionerClass(WikiPartitioner.class);
job.setGroupingComparatorClass(WikiComparator.class);
job.waitForCompletion(true);
}
}
// Mapper.map
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(",");
String id = words[0];
String[] contents = words[3].toLowerCase().replaceAll("[^a-z]+", " ").split("\\s+");
for (int i = 0; i < contents.length; i++) {
String word = contents[i].trim();
word = stem(word);
WikiWritable output = new WikiWritable();
output.set(id, i);
context.write(new Text(contents[i]), output);
}
}
// Comparator
public class WikiComparator extends WritableComparator {
public WikiComparator() {
super(WikiWritable.class, true);
}
@Override
public int compare(WritableComparable wc1, WritableComparable wc2) {
WikiWritable w1 = (WikiWritable) wc1;
WikiWritable w2 = (WikiWritable) wc2;
return w1.compareTo(w2);
}
}
// Partitioner
public class WikiPartitioner extends Partitioner<WikiWritable, Text> {
@Override
public int getPartition(WikiWritable wikiWritable, Text text, int i) {
return Math.abs(wikiWritable.getDocId().hashCode() % i);
}
}
// Reducer
public class WordReducer extends Reducer<Text, WikiWritable, Text, Text> {
@Override
protected void reduce(Text key, Iterable<WikiWritable> values, Context ctx) throws IOException, InterruptedException {
Map<String, StringBuilder> map = new HashMap<>();
for (WikiWritable w : values) {
String id = String.valueOf(w.getDocId());
if (map.containsKey(id)) {
map.get(id).append(w.getPosition()).append(".");
} else {
map.put(id, new StringBuilder());
map.get(id).append(".").append(w.getPosition()).append(".");
}
}
StringBuilder builder = new StringBuilder();
map.keySet().forEach((k) -> {
map.get(k).deleteCharAt(map.get(k).length() - 1);
builder.append(k).append(map.get(k)).append(";");
});
ctx.write(key, new Text(builder.toString()));
}
}
When constructing a new WikiWritable
, the mapper first calls new WikiWritable()
and then calls set(...)
.
I tried changing docId
and position
to String and Integer and use dataOutput.read()
(I forgot the exact method name but it's something similar) and still doesn't work.
TLDR: You just need to remove your WikiComparator
completely, and not call job.setGroupingComparatorClass
at all.
Explanation:
The group comparator is intended to compare the map output keys, not the map output values. Your map output keys are Text
objects and the values are WikiWritable
objects.
This means that the bytes which are passed to your comparator for deserialisation represent serialised Text
objects. However, the WikiComparator
uses reflection to create WikiWritable
objects (as instructed in its constructor), and then tries to deserialise the Text
objects using the WikiWritable.readFields
method. This obviously leads to wrong reading and consequently to the exception you see.
That said, I believe that you don't need a comparator at all, since the default WritableComparator
does exactly what yours does: calls the compareTo
method for the pair of objects that is passed to it.
EDIT: The compareTo
method that is called is comparing your keys, not your values, so it compares Text
objects. If you want to compare and sort your WikiWritable
s you should consider adding them to a Composite Key. There are plenty of tutorials around on Composite Keys and secondary sorting.