arraysalgorithmmaxsubmatrix

Maximum subarray of size HxW within a big 2D bit matrix


I have a big NxN bit array with K ones (everything else is zero). Coordinates of all non-zero points are known - in other words this NxN array can be represented as an array of K pairs each containing x and y coords of a non-zero point.

Given a submatrix of HxW size, I need place it on my original NxN array such that it covers the most non-zero points.

Input: Height H and width W of submatrix

Output: x and y coords of HxW subarray which has the most ones within itself

Similar question was answered before: Maximum subarray of size HxW within a 2D matrix but in my problem is a bit more complicated since N is huge, in my case: N=60000, K<15000, H,W<10000.

Creating 60000x60000 array would be a memory kill, even if it's an array of bits. That's why I came up with idea of representing that array with all non zero points: one dimensional array of K pairs.

Everything I can come up with is super both memory and time unefficient, I'm looking for any solution that won't eat all my ram. Here's how it's meant to look like: the output would be point (4,3) since HxW subarray, which starts at this point, contains the most ones.

enter image description here


Solution

  • Here's an algorithm that should be O(k2*h) (it could potentially be optimised to O(k*h*w)) and is quite light on space requirements O(k). It works on the theory that any submatrix that has the highest non-zero sum must have a point on its left edge (otherwise, there might be a submatrix with a higher sum to the right of this one). So to find the highest sum, we iterate over each of the non-zero points and find all the submatrixes which have that point on their left edge, summing all the non-zero points within W to the right of the current point for each row in the submatrix.

    Below is a python implementation of that algorithm. It first creates a dictionary of the points in each row, then iterates over each point as described, storing the sum of non-zero points to the right in that row, and then computing the sums for each submatrix based around that point. If the sum is greater than the current maximum, the value and its location are stored. Note this uses 0-indexed lists, so for your sample data the maximum is at (2, 3).

    from collections import defaultdict
    
    def max_subarray(n, nzp, h, w):
        maxsum = 0
        maxloc = (0, 0)
        # create a dictionary of points in a row
        nzpd = defaultdict(list)
        for p in nzp:
            nzpd[p[0]].append(p[1])
        # iterate over each of the non-zero points, looking at all
        # submatrixes that have the point on the left side
        for p in nzp:
            y, x = p
            pointsright = [0] * n
            for r in range(max(y-(h-1), 0), min(y+h, n)):
                # points within w to the right of this column on this row
                pointsright[r] = len([p for p in nzpd[r] if x <= p <= x+(w-1)])
            # compute the sums for each of the possible submatrixes
            for i in range(-h+1, h):
                thissum = sum(pointsright[max(y+i, 0):min(y+i+h, n)])
                if thissum > maxsum:
                    maxsum = thissum
                    maxloc = (y, x)
        # adjust the position in case the submatrix would extend beyond the last row/column
        maxloc = (min(n-h, maxloc[0]), min(n-w, maxloc[1]))
        # print the max sum
        print(f'{maxsum} found at location {maxloc}')
    

    Sample usage:

    nzp = [(0, 6), (1, 9), (2, 3), (2, 4), (2, 5), 
           (3, 1), (3, 4), (3, 6), (4, 3), (4, 3), 
           (4, 10), (5, 5), (6, 4), (6, 8), (7, 5), 
           (8, 3), (10, 2), (10, 8), (11, 4), (11, 10)
           ]
      
    max_subarray(12, nzp, 2, 4)
    

    Output:

    5 found at location (2, 3)
    

    Demo on rextester