I was wondering if numpy
could be used to build the most basic cube model where all cross-combinations and their computed value are stored.
Let's take the following example of data:
AUTHOR BOOK YEAR SALES
Shakespeare Hamlet 2000 104.2
Shakespeare Hamlet 2001 99.0
Shakespeare Romeo 2000 27.0
Shakespeare Romeo 2001 19.0
Dante Inferno 2000 11.6
Dante Inferno 2001 12.6
And to be able to build something like:
YEAR TOTAL
AUTHOR BOOK 2000 2001
(ALL) (ALL) 142.8 130.6 273.4
Shakespeare (ALL) 131.2 118.0 249.2
Dante (ALL) 11.6 12.6 24.2
Shakespeare Hamlet 104.2 99.0 203.2
Shakespeare Romeo 27.0 19.0 46.0
Dante Inferno 11.6 12.6 24.2
I'm hoping that the usage of using something like meshgrid
might get me 75% there. Basically, I'd like to see if it's possible to build a structure of all pre-computed values with numpy
(not pandas) to build a structure so that I could retrieve the above result of all possible combination. For the sake of simplicity, let's only consider the SUM
as the only possible calculation. Perhaps this is a roundable way of asking, but could numpy
be the backbone of doing this, or do I need to use something else?
And finally, if not possible in numpy
how might this be stored in a MDA?
I think numpy
record arrays can be used for this task, below is my solution based on record arrays.
class rec_array():
def __init__(self,author=None,book=None,year=None,sales=None):
self.dtype = [('author','<U20'), ('book','<U20'),('year','<U20'),('sales',float)]
self.rec_array = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
def add_record(self,author,book,year,sales):
new_rec = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
if not self.rec_array.shape == ():
self.rec_array = np.hstack((self.rec_array,new_rec))
else:
self.rec_array = new_rec
def get_view(self,conditions):
"""
conditions:
A list of conditions, for example
[["author",<,"Shakespeare"],["year","<=","2000"]]
"""
mask = np.ones(self.rec_array.shape[0]).astype(bool)
for item in conditions:
field,op,target = item
field_op = "self.rec_array['%s'] %s '%s'" % (field,op,target)
mask &= eval(field_op)
selected_sales = self.rec_array['sales'][mask]
return np.sum(selected_sales)
Based on this rec_array
, given the data
author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]
we create an instance
test = rec_array()
test.add_record(author,book,year,sales)
If, for example, you want the sales of Shakespeare's Romeo, you can simply do this
test.get_view([["author","==","Shakespeare"],["book","==","Romeo"]])
the output is 46.0
or, you can also do
test.get_view([["author","==","Shakespeare"],["year","<=","2000"]])
the output is 131.2