pythonhadoopmapreducemrjob

python with hadoop project: how to build a reducer to concatenate pairs of values


I have a small project with MapReduce and since I am new with this I am running into a lot of difficulties so would appreciate the help. In this project, I have a file that contains the nation, year, and weight. I want to find for each nation's year follows the weight. This is my data

USA, 2019; 0.7
USA, 2020; 0.3
USA, 2021; 0.9
Canada, 2019; 0.6
Canada, 2020; 0.3

the mapper

def idf_country(self, key, values):
  nation, year = key[0], key[1]
  weight = values
  yield nation, (year, weight)

This is what I am trying to get

USA 2019, 0.7; 2020, 0.3; 2021, 0.9
Canada  2019, 0.6; 2020, 0.3

Solution

  • Your mapper reads each line of the file. You need to split the line, not use the key

    def idf_country(self, key, line):
      nation, data = line.split(', ')
      yield nation, data
    

    Then the reducer will already be grouped by the nation, so you can just rejoin the values

    def reducer(self, nation, values):
      yield nation, ', '.join(values)