Given boolean matrix M, I need to find a set of submatrices A = {A1, ..., An} such that matrices in A contain all True values in matrix M and only them. Submatrices don't have to be continuous, i.e. each submatrix is defined by the two sets of indices {i1, ..., ik}, {j1, ..., jt} of M. (For example submatrix could be something like [{1, 2, 5}, {4, 7, 9, 13}] and it is all cells in intersection of these rows and columns.) Optionally submatrices can intersect if this results in better solution. The total number of submatrices n should be minimal.
Size of the matrix M can be up to 10^4 x 10^4, so I need an effective algorithm. I suppose that this problem may not have an effective exact algorithm, because it reminds me some NP-hard problems. If this is true, then any good and fast approximation is OK. We can also suggest that the amount of true values is not very big, i.e. < 1/10 of all values, but to not have accidental DOS in prod, the solution not using this fact is better.
I don't need any code, just a general idea of the algorithm and justification of its properties, if it's not obvious.
Background
We are calculating some expensive distance matrices for logistic applications. Points in these requests are often intersecting, so we are trying do develop some caching algorithm to not calculate parts of some requests. And to split big requests into smaller ones with only unknown submatrices. Additionally some distances in the matrix may be not needed for the algorithm. On the one hand the small amount of big groups calculates faster, on the other hand if we include a lot of "False" values, and our submatrices are unreasonably big, this can slow down the calculation. The exact criterion is intricate and the time complexity of "expensive" matrix requests is hard to estimate. As far as I know for square matrices it is something like C*n^2.5 with quite big C. So it's hard to formulate a good optimization criterion, but any ideas are welcome.
About data
True value in matrix means that the distance between these two points have never been calculated before. Most of the requests (but not all) are square matrices with the same points on both axes. So most of the M is expected to be almost symmetric. And also there is a simple case of several completely new points and the other distances are cached. I deal with this cases on preprocessing stage. All the other values can be quite random. If they are too random we can give up cache and calculate the full matrix M. But sometimes there are useful patterns. I think that because of the nature of the data it is expected to contain more big sumbatrices then random data. Mostly True values are occasional, but form submatrix patterns, that we need to find. But we cannot rely on this completely, because if algorithm gets too random matrix it should be able to at least detect it to not have too long and complex calculations.
Update
As stated in wikipedia this problem is called Bipartite Dimension of a graph and is known to be NP-hard. So we can reformulate it info finding fast relaxed approximations for the simple cases of the problem. We can allow some percentage of false values and we can adapt some simple, but mostly effective greedy heuristic.
I started working on the algorithm below before you provided the update.
Also, in doing so I realised that while one is looking for blocks of true values, the problem is not one of a block transformation, as you have also now updated.
The algorithm is as as follows:
This algorithm will produce a complete cover of the boolean matrix consisting of row-column intersection sub-matrices containing only true values.
I am not sure if allowing some falses in a sub-matrix will help. While it will allow bigger sub-matrices to be found and hence reduce the number of passes of the boolean matrix to find a cover, it will presumably take longer to find the biggest such sub-matrices because there will be more combinations to check. Also, I am not sure how one might stop falsey sub-matrices from overlapping. It might need the maintenance of a separate mask matrix rather than using the boolean matrix as its own mask, in order to ensure disjoint sub-matrices.
Below is a first cut implementation of the above algorithm in python.
I ran it on Windows 10 on a Intel Pentium N3700 @ 1.60Ghz with 4GB RAM As is, it will do, with randomly generated ~10% trues:
I have not tested it on approximately symmetric matrices, nor have I tested it on matrices with relatively large sub-matrices. It might perform well with relatively large sub-martrices, eg, in the extreme case, ie, the entire boolean matrix is true, only two passes of the algorithm loop are required.
One area I think there can be considerable optimisation is in the row sorting. The implementation below uses the in-built phython sort with a comparator function. A custom crafted sort function will probably do much better, and possibly especially so if it is a virtual sort similar to the column sorting.
If you can try it on some real data, ie, square, approximately symmetric matrix, with relatively large sub-matrices, it would be good to know how it goes.
Please advise if you would like to me to try some optimisation of the python. I presume to handle 10^4 x 10^4 boolean matrices it will need to be a lot faster.
from functools import cmp_to_key
booleanMatrix0 = [
( 0, 0, 0, 0, 1, 1 ),
( 0, 1, 1, 0, 1, 1 ),
( 0, 1, 0, 1, 0, 1 ),
( 1, 1, 1, 0, 0, 0 ),
( 0, 1, 1, 1, 0, 0 ),
( 1, 1, 0, 1, 0, 0 ),
( 0, 0, 0, 0, 0, 0 )
]
booleanMatrix1 = [
( 0, )
]
booleanMatrix2 = [
( 1, )
]
booleanMatrix3 = [
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0 )
]
booleanMatrix4 = [
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1 )
]
booleanMatrix14 = [
( 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0 ),
( 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1 ),
( 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0 ),
( 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0 ),
( 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1 ),
( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1 ),
( 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1 ),
( 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1 ),
( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1 ),
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ),
( 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 )
]
booleanMatrix15 = [
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ),
( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ),
]
booleanMatrix16 = [
( 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1 ),
( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1 ),
]
import random
booleanMatrix17 = [
]
for r in range(11):
row = []
for c in range(21):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix17.append(tuple(row))
booleanMatrix18 = [
]
for r in range(21):
row = []
for c in range(11):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix18.append(tuple(row))
booleanMatrix5 = [
]
for r in range(50):
row = []
for c in range(200):
row.append(random.randrange(2))
booleanMatrix5.append(tuple(row))
booleanMatrix6 = [
]
for r in range(200):
row = []
for c in range(50):
row.append(random.randrange(2))
booleanMatrix6.append(tuple(row))
booleanMatrix7 = [
]
for r in range(100):
row = []
for c in range(100):
row.append(random.randrange(2))
booleanMatrix7.append(tuple(row))
booleanMatrix8 = [
]
for r in range(100):
row = []
for c in range(1000):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix8.append(tuple(row))
booleanMatrix9 = [
]
for r in range(1000):
row = []
for c in range(100):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix9.append(tuple(row))
booleanMatrix10 = [
]
for r in range(317):
row = []
for c in range(316):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix10.append(tuple(row))
booleanMatrix11 = [
]
for r in range(3162):
row = []
for c in range(316):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix11.append(tuple(row))
booleanMatrix12 = [
]
for r in range(316):
row = []
for c in range(3162):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix12.append(tuple(row))
booleanMatrix13 = [
]
for r in range(1000):
row = []
for c in range(1000):
if random.randrange(5) == 1:
row.append(random.randrange(2))
else:
row.append(0)
booleanMatrix13.append(tuple(row))
booleanMatrices = [ booleanMatrix0, booleanMatrix1, booleanMatrix2, booleanMatrix3, booleanMatrix4, booleanMatrix14, booleanMatrix15, booleanMatrix16, booleanMatrix17, booleanMatrix18, booleanMatrix6, booleanMatrix5, booleanMatrix7, booleanMatrix8, booleanMatrix9, booleanMatrix10, booleanMatrix11, booleanMatrix12, booleanMatrix13 ]
def printMatrix(matrix, colOrder):
for r in range(rows):
row = ""
for c in range(cols):
row += str(matrix[r][0][colOrder[c]])
print(row)
print()
def rowUp(matrix):
rowCount = []
maxRow = [ 0, 0 ]
for r in range(rows):
rowCount.append([ r, sum(matrix[r][0]) ])
if rowCount[-1][1] > maxRow[1]:
maxRow = rowCount[-1]
return rowCount, maxRow
def colSort(matrix):
# For a row with the highest number of trues, sort the true columns to the left
newColOrder = []
otherCols = []
for c in range(cols):
if matrix[maxRow[0]][0][colOrder[c]]:
newColOrder.append(colOrder[c])
else:
otherCols.append(colOrder[c])
newColOrder += otherCols
return newColOrder
def sorter(a, b):
# Sort rows according to leading trues
length = len(a)
c = 0
while c < length:
if a[0][colOrder[c]] == 1 and b[0][colOrder[c]] == 0:
return -1
if b[0][colOrder[c]] == 1 and a[0][colOrder[c]] == 0:
return 1
c += 1
return 0
def allTrues(rdx, cdx, matrix):
count = 0
for r in range(rdx+1):
for c in range(cdx+1):
if matrix[r][0][colOrder[c]]:
count += 1
else:
return
return rdx, cdx, count
def getBiggestField(matrix):
# Starting at (0, 0) find biggest rectangular field of 1s
biggestField = (None, None, 0)
cStop = cols
for r in range(rows):
for c in range(cStop):
rtn = allTrues(r, c, matrix)
if rtn:
if rtn[2] > biggestField[2]:
biggestField = rtn
else:
cStop = c
break;
if cStop == 0:
break
return biggestField
def mask(matrix):
maskMatrix = []
for r in range(rows):
row = []
for c in range(cols):
row.append(matrix[r][0][c])
maskMatrix.append([ row, matrix[r][1] ])
maskRows = []
for r in range(biggestField[0]+1):
maskRows.append(maskMatrix[r][1])
for c in range(biggestField[1]+1):
maskMatrix[r][0][colOrder[c]] = 0
maskCols= []
for c in range(biggestField[1]+1):
maskCols.append(colOrder[c])
return maskMatrix, maskRows, maskCols
# Add a row id to each row to keep track of rearranged rows
rowIdedMatrices = []
for matrix in booleanMatrices:
rowIdedMatrix = []
for r in range(len(matrix)):
rowIdedMatrix.append((matrix[r], r))
rowIdedMatrices.append(rowIdedMatrix)
import time
for matrix in rowIdedMatrices:
rows = len(matrix)
cols = len(matrix[0][0])
colOrder = []
for c in range(cols):
colOrder.append(c)
subMatrices = []
startTime = time.thread_time()
loopStart = time.thread_time()
loop = 1
rowCount, maxRow = rowUp(matrix)
ones = 0
for row in rowCount:
ones += row[1]
print( "_________________________\n", "Rows", rows, "Columns", cols, "Ones", str(int(ones * 10000 / rows / cols) / 100) +"%")
colOrder = colSort(matrix)
matrix.sort(key=cmp_to_key(sorter))
biggestField = getBiggestField(matrix)
if biggestField[2] > 0:
maskMatrix, maskRows, maskCols = mask(matrix)
subMatrices.append(( maskRows, maskCols ))
while biggestField[2] > 0:
loop += 1
rowCount, maxRow = rowUp(maskMatrix)
colOrder = colSort(maskMatrix)
maskMatrix.sort(key=cmp_to_key(sorter))
biggestField = getBiggestField(maskMatrix)
if biggestField[2] > 0:
maskMatrix, maskRows, maskCols = mask(maskMatrix)
subMatrices.append(( maskRows, maskCols) )
if loop % 100 == 0:
print(loop, time.thread_time() - loopStart)
loopStart = time.thread_time()
endTime = time.thread_time()
print("Sub-matrices:", len(subMatrices), endTime - startTime)
for sm in subMatrices:
print(sm)
print()
input("Next matrix")