How to find pattern groups in boolean array?
Asked Answered
O

1

7

Given a 2D array of Boolean values I want to find all patterns that consist of at least 2 columns and at least 2 rows. The problem is somewhat close to finding cliques in a graph.

In the example below green cells represent "true" bits, greys are "false". Pattern 1 contains cols 1,3,4 and 5 and rows 1 and 2. Pattern 2 contains only columns 2 and 4, and rows 2,3,4.

example

Business idea behind this is finding similarity patterns among various groups of social network users. In real world number of rows can go up to 3E7, and the number of columns up to 300.

Can't really figure out a solution other than brute force matching.

Please advice the proper name of the problem, so I could read more, or advice an elegant solution.

Oneway answered 25/4, 2015 at 17:46 Comment(8)
In case the array is symmetric, and all of the diagonal entries are true, you have to find symmetric all-true sub-arrays which correspond to the cliques in a graph, among others. Since determining the largest clique (or largest independent set in the complement) is NP-hard, so is this problem. That doesn't mean it can't be done in practice, but it does suggest that you should provide more information about the specifics of the arrays instead of hoping for a fast, general algorithm.Corpsman
What is exactly a pattern? It is still not clear to me.Larimor
@DouglasZare, thank you for the comment. Array is not symmetric. Row represents a user, and the bits are on for independent properties that varies from research to research. I.e. b1 "Has higher education", b2 "Liked page X", b3 :liked page Y" etc. JuanLopes by a "pattern" I mean set of "on" columns, more than 2, that apply to more than 2 users.Oneway
Yes, I understand that your array is not symmetric. However, if there were a fast, general algorithm, it would apply to symmetric matrices, too, so it would be able to solve an NP-hard problem quickly. This is a big hint that you won't find a fast, general algorithm, and you need to use properties of the particular matrices you are dealing with. For example, maybe you know that each row contains only a few true values, or all but a few. Something like that might be used by an algorithm that would be fast for your problem while not having to solve an NP-hard problem quickly.Corpsman
What's the problem with computing all the intersections between each row/column and each other row/column O(n^2) (this is the set of possible patterns, you can reduce it by removing patterns of size less then 3) and then just brute force it? Overall O(n^3) at most.Mckeever
@AlexandruBarbarosie: It's much worse than O(n^3) unfortunately, because you have to consider all subsets of rows (and of columns). Adding some row to a solution can reduce the set of columns that it contains, and vice versa, so greedy doesn't work.Prodigal
@j_random_hacker, its proven to be O(n^2).Oneway
@Serge: I only skimmed that paper, but they talk about an algorithm with quadratic delay -- that means it takes O(n^2) time to produce the next maximal pattern, not all of them (I'm certain that there are graph families that have asymptotically more than O(n^2) bicliques, which if true means that O(n^2) time to generate them all is impossible).Prodigal
P
4

This is (equivalent to) asking for all bicliques (complete bipartite subgraphs) larger than a certain size in a bipartite graph. Here the rows are the vertices of one part A of the graph, and the columns are the vertices of the other part B, and there is an edge between u \in A and v \in B whenever the cell at row u, column v is green.

Although you say that you want to find all patterns, you probably only want to find only maximal ones -- that is, patterns that cannot be extended to become larger patterns by adding more rows or columns. (Otherwise, for any pattern with c >= 2 columns and r >= 3 rows, you will also get back the more than 2^(c-2)*2^(r-3) non-maximal patterns that can be formed by deleting some of the rows or columns.)

But even listing just the maximal patterns can take time exponential in the number of rows and columns, assuming that P != NP. That's because the problem of finding a maximum (i.e. largest-possible) pattern, in terms of the total number of green cells, has been proven to be NP-complete: if it were possible to list all maximal patterns in polynomial time, then we could simply do so, and pick the largest, thereby solving this NP-complete problem in polynomial time.

Prodigal answered 25/4, 2015 at 19:26 Comment(1)

© 2022 - 2024 — McMap. All rights reserved.