pythonoophadoopmapreducemrjob

How to count the number of times a word sequence appears in a file, using MapReduce in Python?


Consider a file containing words separated by spaces; write a MapReduce program in Python, which counts the number of times each 3-word sequence appears in the file.

For example, consider the following file:

one two three seven one two three
three seven one
seven one two

The number of times each 3 word sequence appears in this file is:

"three seven one" 2
"four seven one two" 1
"one two three" 2
"seven one two" 2
"two three seven" 1

Code format:

from mrjob.job import MRJob


class MR3Nums(MRJob):
    
    def mapper(self,_, line):
        pass

    def reducer(self,key, values):
        pass
    

if __name__ == "__main__":
    MR3Nums.run()

Solution

  • The mapper is applied on each line, and should count each 3-word sequence, i.e. yield the 3-word sequence along with a count of 1.

    The reducer is called with key and values, where key is a 3-word sequence and values is a list of counts (which would be a list of 1s). The reducer can simply return a tuple of the 3-word sequence and the total number of occurrences, the latter obtained via sum.

    class MR3Nums(MRJob):
        
        def mapper(self, _, line):
            sequence_length = 3
            words = line.strip().split()
            for i in range(len(words) - sequence_length + 1):
                yield " ".join(words[i:(i+sequence_length)]), 1
    
        def reducer(self, key, values):
            yield key, sum(values)