pythonmapreducemrjob

How do you sort a key,value pair using MapReduce?


I have been messing around with MapReduce, still very new to it, and was wondering if I could get some help with a question I'm having trouble answering: I have a txt file of dates and counts and want to sort the dates in ascending order based on their respective counts. The text file looks like this:

enter image description here

I have looked around and found some code like this:

import re

from mrjob.job import MRJob from mrjob.step import MRStep

WORD_RE = re.compile(r"[\w']+")

class MRWordFrequencyCount(MRJob):

def steps(self):
    return [
        MRStep(
            mapper=self.mapper_extract_words, combiner=self.combine_word_counts,
            reducer=self.reducer_sum_word_counts
        ),
        MRStep(
            reducer=self.reduce_sort_counts
        )
    ]

def mapper_extract_words(self, _, line):
    for word in WORD_RE.findall(line):
        yield word.lower(), 1

def combine_word_counts(self, word, counts):
    yield word, sum(counts)

def reducer_sum_word_counts(self, key, values):
    yield None, (sum(values), key)

def reduce_sort_counts(self, _, word_counts):
    for count, key in sorted(word_counts, reverse=True):
        yield ('%020d' % int(count), key)

But this seems too complex, because as you can see from the postedDates txt file, I already have the keys and their respective counts. So do I just need to add a second step that is just a reducer function that sorts the list of keys and values using "sorted(counts)"?

Kind regards for your time.


Solution

  • You are correct, given your particular setup, can certainly perform your task with a single MapReduce.

    You can skip the initial steps in your example because you already have the count for each date (key). You can just perform the second step of grouping the pairs together into tuples and sorting based on count and date


    Counts in Ascending Order

    import datetime
    
    from mrjob.job import MRJob
    
    
    class MRDateFrequencyCount(MRJob):
    
        def mapper(self, _, line):
            date, count = line.split(' ')
            yield None, (int(count), date)
    
        def reducer(self, _, dates):
            for count, date in sorted(dates, key=lambda x: (x[0], datetime.datetime.strptime(x[1], '"%Y-%m-%d"'))):
                yield date, count
    
    
    if __name__ == '__main__':
        MRDateFrequencyCount.run()
    

    Produces Output:

    "\"2006-11-01\""    1
    "\"2006-12-21\""    1
    "\"2006-12-11\""    2
    "\"2007-03-12\""    3
    

    Counts in Descending Order

    import datetime
    
    from mrjob.job import MRJob
    
    
    class MRDateFrequencyCount(MRJob):
    
        def mapper(self, _, line):
            date, count = line.split(' ')
            yield None, (int(count), date)
    
        def reducer(self, _, dates):
            for count, date in sorted(dates, key=lambda x: (-x[0], datetime.datetime.strptime(x[1], '"%Y-%m-%d"'))):
                yield date, count
    
    
    if __name__ == '__main__':
        MRDateFrequencyCount.run()
    

    Produces Output:

    "\"2007-03-12\""    3
    "\"2006-12-11\""    2
    "\"2006-11-01\""    1
    "\"2006-12-21\""    1
    

    Note: You will need to change the strptime format string '"%Y-%m-%d"' if the data from your image is formatted differently than it appears in the text I tested on shown below.


    Both MRJobs were run with No configs and on a text document containing the following text:

    "2006-12-21" 1
    "2007-03-12" 3
    "2006-11-01" 1
    "2006-12-11" 2
    

    Naturally, you can change around the yeild in either reducer if you wanted to change which column (count or date) comes first or second.

    You also can use string formatting to get rid of the " that surround the dates in your dataset.