hadoopmapreduceinverted-index

Hadoop Map Reduce Inverted Index Retrieve Line Number


I'm trying to build an inverted index search by using Hadoop Map Reduce, given as input text files, and trying to achieve the following output: word: (file#1, line#1, line#2, ….) (file#4, line#1, line#2,…) …)


Solution

  • After some hours of research, I've found the solution online:

    https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-recordreader-example/.

    A custom RecordReader class is needed, alongside a custom FileInputFileFormat, in order to set the line number as the key, when the split is done by the map method. Inside the RecordReader implementation, custom fields can be declared and the reading of the input files can be fully managed.

    In this case, adding a new int field called lineNumber for example inside the RecordReader custom implementation is enough, and incrementing it whenever a line is read (nextKeyValue() method).