pythonmultithreadingsignal-processingsimulationvlsi

Systolic Array Simulation in Python


I am trying to simulate a systolic array structure -- all of which I learned from these slides: http://web.cecs.pdx.edu/~mperkows/temp/May22/0020.Matrix-multiplication-systolic.pdf -- for matrix multiplication in a Python environment. An integral part of a systolic array is that data flow between PE's is concurrent with any multiplication or addition that occurs on any one node. I am having difficulty surmising how exactly to implement such a concurrent procedure in Python. Specifically, I hope to better understand a computational approach to feed the elements of the matrices to be multiplied into the systolic array in a cascading fashion, while allowing these elements to propagate through the array in a concurrent fashion.

I have begun writing some code in python to multiple two 3 by 3 array's, but ultimately, I want to simulate any sized systolic array to work with any sized a and b matrices.

from threading import Thread
from collections import deque

vals_deque = deque(maxlen=9*2)#will hold the interconnections between nodes of the systolicarray
dump=deque(maxlen=9) # will be the output of the SystolicArray
prev_size = 0

def setupSystolicArray():
    global SystolicArray
    SystolicArray = [NodeSystolic(i,j) for i in range(3), for i in range(3)]

def spreadInputs(a,b):
    #needs some way to initially propagate the elements of a and b through the top and leftmost parts of the systolic array

    new = map(lambda x: x.start() , SystolicArray) #start all the nodes of the systolic array, they are waiting for an input

    #even if i found a way to put these inputs into the array at the start, I am not sure how to coordinate future inputs into the array in the cascading fashion described in the slides
    while(len(dump) != 9):
        if(len(vals_deque) != prev_size):

            vals = vals_deque[-1]
            row = vals['t'][0]
            col = vals['l'][0]
            a= vals['t'][1]
            b = vals['l'][1]
            # these if elif statements track whether the outputs are at the edge of the systolic array and can be removed
            if(row >= 3):
                dump.append(a)
            elif(col >= 3):
                dump.append(b)
            else:
                #something is wrong with the logic here
                SystolicArray[t][l-1].update(a,b)
                SystolicArray[t-1][l].update(a,b)   

class NodeSystolic:
    def __init__(self,row, col):
        self.row = row
        self.col = col
        self.currval = 0
        self.up = False
        self.ain = 0#coming in from the top
        self.bin = 0#coming in from the side

    def start(self):
        Thread(target=self.continuous, args = ()).start()

    def continuous(self):
        while True:
            if(self.up = True):
                self.currval = self.ain*self.bin
                self.up = False
                self.passon(self.ain, self.bin)
            else:
                pass

    def update(self, left, top):
        self.up = True
        self.ain = top
        self.bin = left

    def passon(self, t,l):
        #this will passon the inputs this node has received onto the next nodes
        vals_deque.append([{'t': [self.row+ 1, self.ain], 'l': [self.col + 1, self.bin]}])

    def returnValue(self):
        return self.currval


def main():
    a = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9],
    ])

    b = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
    ])

    setupSystolicArray()
    spreadInputs(a,b)

The above code is not operational and still has many bugs. I was hoping someone could give me pointers on how to improve the code, or whether there is a much simpler way to model the parallel procedures of a systolic array with the asynchronous properties in Python, so with very large systolic array sizes, I won't have to worry about creating too many threads (nodes).


Solution

  • It's interesting to think about simulating a systolic array in Python, but I think there are some significant difficulties in doing this along the lines you've sketched out above.

    Most importantly there are the issues about Python's limited scope for true parallelism caused by the Global Interpreter Lock. This means that you won't get any significant parallelism for compute-limited tasks, and its threads are probably best suited to handling I/O limited tasks such as web-requests or filesystem accesses. The nearest Python can get to this is probably via the multiprocessing module, but that would require separate process for each node.

    Secondly, even if you were going to get parallelism in the numerical operations within your systolic array, you'd need to have some locking mechanisms to allow different threads to exchange data (or messages) without corrupting each other's memory when they try to read and write data at the same time.

    As regards the datastructures in your example, I think you might be better off having each node in the systolic array having a reference to its upstream nodes, rather than knowing that it lies at a particular location in an NxM grid. I don't think there's any reason why a systolic array needs to be a rectangular grid, and any from of Directed Acyclic Graph (DAG) would still have the potential for efficient distributed computation.

    Overall, I'd expect the computational overheads of doing this simulation in Python to be enormous relative to what could be achieved by lower-level languages such as Scala or C++. Even then, unless each node in the systolic array is doing a lot of computation (i.e. much more than a few multiply-adds), then the overheads of exchanging messages between nodes will be substantial. So, I presume your simulation is mainly to get an understanding of the data flows, and the high-level behaviour of the array, rather than to get anywhere close to what could be provided by custom DSP (Digital Signal Processing) hardware. If that's the case, then I'd be tempted just to do without the threading and use a centralized message-queue to which all nodes submit messages that are delivered by a global message-distribution mechanism.