pythonnumpymultidimensional-arrayolapolap-cube

Build a basic cube with numpy?


I was wondering if numpy could be used to build the most basic cube model where all cross-combinations and their computed value are stored.

Let's take the following example of data:

AUTHOR         BOOK          YEAR        SALES
Shakespeare    Hamlet        2000        104.2
Shakespeare    Hamlet        2001        99.0
Shakespeare    Romeo         2000        27.0
Shakespeare    Romeo         2001        19.0
Dante          Inferno       2000        11.6
Dante          Inferno       2001        12.6

And to be able to build something like:

                             YEAR                  TOTAL
AUTHOR            BOOK       2000       2001         
(ALL)             (ALL)      142.8      130.6      273.4
Shakespeare       (ALL)      131.2      118.0      249.2
Dante             (ALL)      11.6       12.6       24.2
Shakespeare       Hamlet     104.2      99.0       203.2
Shakespeare       Romeo      27.0       19.0       46.0
Dante             Inferno    11.6       12.6       24.2

I'm hoping that the usage of using something like meshgrid might get me 75% there. Basically, I'd like to see if it's possible to build a structure of all pre-computed values with numpy (not pandas) to build a structure so that I could retrieve the above result of all possible combination. For the sake of simplicity, let's only consider the SUM as the only possible calculation. Perhaps this is a roundable way of asking, but could numpy be the backbone of doing this, or do I need to use something else?

And finally, if not possible in numpy how might this be stored in a MDA?


Solution

  • I think numpy record arrays can be used for this task, below is my solution based on record arrays.

    class rec_array():
        
        def __init__(self,author=None,book=None,year=None,sales=None):
            self.dtype = [('author','<U20'), ('book','<U20'),('year','<U20'),('sales',float)]
            self.rec_array = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
            
        def add_record(self,author,book,year,sales):
            new_rec = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
            if not self.rec_array.shape == ():
                self.rec_array = np.hstack((self.rec_array,new_rec))
            else:
                self.rec_array = new_rec
        
        def get_view(self,conditions):
            """
            conditions: 
                A list of conditions, for example 
                [["author",<,"Shakespeare"],["year","<=","2000"]]
            """
            mask = np.ones(self.rec_array.shape[0]).astype(bool)
            for item in conditions:
                field,op,target = item
                field_op = "self.rec_array['%s'] %s '%s'" % (field,op,target)
                mask &= eval(field_op)
            
            selected_sales = self.rec_array['sales'][mask]
            
            return np.sum(selected_sales)
    

    Based on this rec_array, given the data

    author = 4*["Shakespeare"]+ 2*["Dante"]
    book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
    year = 3*["2000", "2001"]
    sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]
    

    we create an instance

    test = rec_array()
    test.add_record(author,book,year,sales)
    

    If, for example, you want the sales of Shakespeare's Romeo, you can simply do this

    test.get_view([["author","==","Shakespeare"],["book","==","Romeo"]])
    

    the output is 46.0

    or, you can also do

    test.get_view([["author","==","Shakespeare"],["year","<=","2000"]])
    

    the output is 131.2