As Evgeny notes in the comments, this is known as the maximum leaf spanning tree problem. I've linked to the Wikipedia article on the very closely related connected dominating set problem, which is the problem of finding a minimum set of vertices that (i) induce a connected subgraph (ii) satisfy the proposition that, for all other vertices v, some vertex in the set is adjacent to v. The two problems shown to be solution-equivalent by observing that, given a spanning tree, we can construct a connected dominating set by dropping the leaves (vertices with exactly one connection), and given a connected dominating set, we can extract a spanning tree of the induced subgraph and attaching the other vertices as leaves.
Unfortunately, both problems are NP-hard, and they stay NP-hard under a restriction to planar graphs. I'm not familiar with the literature on connected dominating set in particular, but my guess is that there are three angles.
- Provably "fast" exponential-time exact algorithms / approximation algorithms.
- Exact algorithms that are not provably fast (e.g., integer programming) but good in practice.
- Heuristics.
#1 may look like a strange grouping, but what tends to happen in the planar graph literature is that the exact algorithms get used as a subroutine inside the approximation algorithms, often via a technique due to Brenda Baker known as shifting. One of the properties of planar graphs is that a parameter called treewidth is bounded by O(sqrt(n)) instead of n, and there are dynamic programs whose running time exponent is a function of the much smaller treewidth. (E.g., on grids, you can run a DP row by row. The tree-decomposition machinery generalizes this to arbitrary planar graphs.)
It's hard to advise you on the best course without knowing what the instances look like and maybe even without experimenting with them. I'd probably go with door #2, but I'm not sure what a good formulation would look like. The good news is that most of the algorithmic complexity is abstracted into the solver library that you'll be using. Here's a formulation of unknown quality.
For all vertices v
, let x_v
be 1
if v
is a non-leaf and 0
if v
is a leaf. The dominating set part is easy.
minimize sum_v x_v
subject to
for all v, sum_{w such that w = v or w ~ v} x_w >= 1
for all v, x_v in {0, 1}
Here I'm using ~
to mean "is adjacent to". Enforcing the connectivity constraint is trickier. The simplest approach that I can think of is to solve the integer program as is, then look for two vertices s
and t
that are both chosen but not connected in the solution, compute a minimum vertex separator U
between s
and t
among separators not including a chosen vertex, enter a constraint
(1 - x_s) + (1 - x_t) + sum_{v in U} x_v >= 1
and then try again.
I'd be more hopeful about an approach that uses exponentially many variables, but it may be significantly harder to implement (row and column generation). Choose a vertex r
that will be forced as a non-leaf (guess or try all possibilities). There is one variable y_P
for each simple path P
with r
as an endpoint.
minimize sum_v x_v
subject to
for all v, for all P having v as an interior vertex,
x_v >= y_P
for all v, sum_{P having v as an endpoint} y_P >= 1
for all v, x_v in {0, 1}
for all P, y_P in {0, 1}