At the time I write this, the exact criterion we are trying to optimise hasn't been settled on. But whatever criterion is eventually decided on, the following heuristic (i.e. generally suboptimal) approach might be useful:
Consider only a small number of "layouts" for combining a small number of rectangles into a single larger rectangle. Then recursively look at ways of combining these new rectangles into even larger rectangles, using just the same few layouts, until only a single rectangle remains.
This doesn't guarantee an optimal solution, because some subsets of photos are constrained to form subrectangles within the final solution. But it seems likely that it will give reasonable-quality solutions.
For example, for 3 rectangles A, B and C, consider the following 4 layouts:
A
B
C
ABC
AB (i.e. A appears on the left)
AC
AA (i.e. A appears on the top)
BC
Cropping will only occur in the first round, when we are combining groups of 3 photos. For this step, every subset of 3 photos should be considered under each of the 4 layouts above, and the optimal scaling and cropping determined for each, bearing in mind that the resulting rectangle could be scaled up or down by later steps. (A good choice of optimality criterion should have the property that the ideal amount of scaling and cropping for each of A, B and C under a particular layout is not affected by how much the resulting rectangle is scaled, so this shouldn't be a problem.)
Subsequent combining rounds will behave similarly, but without considering any cropping. For a full solution, round 2 would involve trying to combine all sets of 3 rectangles produced by round 1 in which all 9 photos are distinct, however following this approach will lead to an exponential blowup. It should suffice to keep just the best few arrangements for each subset of photos. Note that it's important that each photo appear in at least 1 of the rectangles produced by each round.