I'm trying to build an inverted index search by using Hadoop Map Reduce, given as input text files, and trying to achieve the following output: word: (file#1, line#1, line#2, ….) (file#4, line#1, line#2,…) …)
After some hours of research, I've found the solution online:
https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-recordreader-example/.
A custom RecordReader
class is needed, alongside a custom FileInputFileFormat
, in order to set the line number as the key, when the split is done by the map method. Inside the RecordReader
implementation, custom fields can be declared and the reading of the input files can be fully managed.
In this case, adding a new int
field called lineNumber
for example inside the RecordReader
custom implementation is enough, and incrementing it whenever a line is read (nextKeyValue()
method).