pythonhadoopmapreducemrjob

How to steps differences reduce in Hadoop?


How to steps differences reduce in Hadoop?
I have a problem with understand Hadoop. I have two files and first I did a join between those files. One file is about countries and the other is about client in each country.
Example, clients.csv:

Bertram Pearcy  ,bueno,SO
Steven Ulman  ,regular,ZA

Countries.csv

Name,Code   
Afghanistan,AF
Ã…land Islands,AX
Albania,AL  
…

I did one map reduce that give me how many “good” (bueno) clients have a country (ZA, SO) and with countries.csv I know with country we are talking.

I programmed:

def steps(self): 
        # ordenamos las operaciones para su ejecución.
        return [ 
            MRStep(mapper=self.mapper 
                   ,reducer=self.reducer),            
            MRStep(mapper=self.mapper1
                   ,combiner=self.combiner_cuenta_palabras
                   ,reducer=self.reducer2
                    ),
        ]  

The result of my map/reduce is:

["South Georgia and the South Sandwich Islands"]    1
["South Sudan"] 1
["Spain"]   3

Now, I would like to know which one is the best.

I added one reduce more.

    def reducer3(self, _, values):            
        yield  _, max (values)
        
    def steps(self): 
        # ordenamos las operaciones para su ejecución.
        return [ 
            MRStep(mapper=self.mapper 
                   ,reducer=self.reducer),  
            MRStep(mapper=self.mapper1
                   ,combiner=self.combiner_cuenta_palabras
                   ,reducer=self.reducer2),
            MRStep(#mapper=self.mapper3,
                   reducer=self.reducer3
                   #,reducer=self.reducer3
            ),            
        ]   

But I have the same answer than without that reducer


I try to use one map/reduce program adding another reduce. It that does not work.

With my first reduce I got:

A, 10
C, 2
D, 5

Now, I would like to use that result I get: A, 10

Additional comment:

INPUT [Fille1]+[File2] => enter image description here

MAP/REDUCE => OUT

enter image description here

Now, I need that with additional map/reduce ( and I would like to use what I did) get another answers.

First) For instance, one and only one answer. Example: 3 Spain

Second) All with the best or bigger number, 3 Spain and 3 Guan.

I try to use:

def reducer3(self, _, values):            
        yield  _, max (values)

And I add,

def steps(self): 
        # ordenamos las operaciones para su ejecución.
        return [ 
            MRStep(mapper=self.mapper 
                   ,reducer=self.reducer),  
            MRStep(mapper=self.mapper1
                   ,combiner=self.combiner_cuenta_palabras
                   ,reducer=self.reducer2),
            MRStep(reducer=self.reducer3
            ),            
        ]    

But I still have the same result. I Know that REDUCER3 is using because if I write max(values)+1000 give me the same result but with number 1001, 1003


Solution

  • Your reducer is getting 3 distinct keys, therefore you're finding the max of each, and values only has one element (try printing its length... ). Therefore, you get 3 results.

    You need a third mapper that returns (None, f'{key}|{value}) for example, then all records will be sent to one reducer, where you can then iterate, parse, and aggregate the results

    def reducer3(self, _, values):
        _max = float('-inf')
        k_out = None
        for x in values:
            k, v = x.split('|')
            if int(v) > _max:
                _max = v
                k_out = k
        yield  k_out, _max
    

    That'll only return one result for all values. If you want to capture equal max values, I think you'll need to iterate over the list more than once, then yield within a loop of found max elements