arraysalgorithmpattern-matchingnpclique

How to find pattern groups in boolean array?


Given a 2D array of Boolean values I want to find all patterns that consist of at least 2 columns and at least 2 rows. The problem is somewhat close to finding cliques in a graph.

In the example below green cells represent "true" bits, greys are "false". Pattern 1 contains cols 1,3,4 and 5 and rows 1 and 2. Pattern 2 contains only columns 2 and 4, and rows 2,3,4.

example

Business idea behind this is finding similarity patterns among various groups of social network users. In real world number of rows can go up to 3E7, and the number of columns up to 300.

Can't really figure out a solution other than brute force matching.

Please advice the proper name of the problem, so I could read more, or advice an elegant solution.


Solution

  • This is (equivalent to) asking for all bicliques (complete bipartite subgraphs) larger than a certain size in a bipartite graph. Here the rows are the vertices of one part A of the graph, and the columns are the vertices of the other part B, and there is an edge between u \in A and v \in B whenever the cell at row u, column v is green.

    Although you say that you want to find all patterns, you probably only want to find only maximal ones -- that is, patterns that cannot be extended to become larger patterns by adding more rows or columns. (Otherwise, for any pattern with c >= 2 columns and r >= 3 rows, you will also get back the more than 2^(c-2)*2^(r-3) non-maximal patterns that can be formed by deleting some of the rows or columns.)

    But even listing just the maximal patterns can take time exponential in the number of rows and columns, assuming that P != NP. That's because the problem of finding a maximum (i.e. largest-possible) pattern, in terms of the total number of green cells, has been proven to be NP-complete: if it were possible to list all maximal patterns in polynomial time, then we could simply do so, and pick the largest, thereby solving this NP-complete problem in polynomial time.