algorithm

Find largest subset of columns that has two equal rows


I have a binary matrix where every column has at least two ones. How can I find the largest subset of the columns so that the submatrix induced by those columns has a row that occurs at least twice? I want to do this quickly.

The subset of columns could be of size one or it could be the whole matrix or anything in between.

The problem is that there are exponentially many subsets of the columns so I really don't want to check every one of them separately.

For example:

[[0 1 1 0 1 1 1]                                                                                 
 [1 0 0 0 0 0 1]                                                                                 
 [1 1 0 1 0 0 0]
 [1 0 1 0 1 0 0]
 [0 0 0 1 0 1 1]
 [1 1 0 0 1 0 1]
 [0 1 0 1 0 0 0]]

If we take columns 0, 2, 3, 6 we get:

[[0 1 0 1]                                                                               
 [1 0 0 1]                                                                               
 [1 0 1 0]
 [1 1 0 0]
 [0 0 1 1]
 [1 0 0 1]
 [0 0 1 0]]

and we see row [1 0 0 1] appears twice.


Solution

  • For a N by M matrix, you can do slightly better than O(N*M^2) time. Let A be your input matrix,x^T denote the transpose of the matrix x, x[i,j] denotes the element in the ith row and jth column of x, and let [1] be shorthand for a NxM matrix of all 1s.

    Take B = 2*A - [1]. Compute the matrix product C = B * B^T. This will give you a NxN matrix where the C[i,j] corresponds to the number of matching elements minus the number of non-matching elements between rows i and j of A. The counts of just matching elements can be obtained as (C + M*[1]) / 2, but we don't actually need that; we only care which is largest which we can find with just C.

    You can use fast matrix multiplication such as the Strassen Algorithm to multiply matrices in sub-cubic time, and then iterate over C to find the largest off-diagonal element C[i,j]. You'll want to skip the main diagonal since that will be just Ms as each row perfectly matches itself. You can also skip either the upper or lower triangle of the matrix, since its symmetric (you can also avoid computing these in the multiplication altogether, for a constant factor speedup).

    Then take rows i and j of A where i != j and C[i,j] is maximal, and compute the indices of the matching columns. Overall, the time is dominated by the matrix multiplication.