I have a cost optimization request that I don't know how if there is literature on. It is a bit hard to explain, so I apologize in advance for the length of the question.
There is a server I am accessing that works this way:
- a request is made on records (r1, ...rn) and fields (f1, ...fp)
- you can only request the Cartesian product (r1, ..., rp) x (f1,...fp)
- The cost (time and money) associated with a such a request is affine in the size of the request:
T((r1, ..., rn)x(f1, ..., fp) = a + b * n * p
Without loss of generality (just by normalizing), we can assume that b=1
so the cost is:
T((r1, ...,rn)x(f1,...fp)) = a + n * p
- I need only to request a subset of pairs
(r1, f(r1)), ... (rk, f(rk))
, a request which comes from the users. My program acts as a middleman between the user and the server (which is external). I have a lot of requests like this that come in (tens of thousands a day).
Graphically, we can think of it as an n x p sparse matrix, for which I want to cover the nonzero values with a rectangular submatrix:
r1 r2 r3 ... rp ------ ___ f1 |x x| |x| f2 |x | --- ------ f3 .. ______ fn |x x| ------
Having:
- the number of submatrices being kept reasonable because of the constant cost
- all the 'x' must lie within a submatrix
- the total area covered must not be too large because of the linear cost
I will name g the sparseness coefficient of my problem (number of needed pairs over total possible pairs, g = k / (n * p)
. I know the coefficient a
.
There are some obvious observations:
- if a is small, the best solution is to request each (record, field) pair independently, and the total cost is:
k * (a + 1) = g * n * p * (a + 1)
- if a is large, the best solution is to request the whole Cartesian product, and the total cost is :
a + n * p
- the second solution is better as soon as
g > g_min = 1/ (a+1) * (1 + 1 / (n * p))
- of course the orders in the Cartesian products are unimportant, so I can transpose the rows and the columns of my matrix to make it more easily coverable, for example:
f1 f2 f3 r1 x x r2 x r3 x x
can be reordered as
f1 f3 f2 r1 x x r3 x x r2 x
And there is an optimal solution which is to request (f1,f3) x (r1,r3) + (f2) x (r2)
- Trying all the solutions and looking for the lower cost is not an option, because the combinatorics explode:
for each permutation on rows: (n!) for each permutation on columns: (p!) for each possible covering of the n x p matrix: (time unknown, but large...) compute cost of the covering
so I am looking for an approximate solution. I already have some kind of greedy algorithm that finds a covering given a matrix (it begins with unitary cells, then merges them if the proportion of empty cell in the merge is below some threshold).
To put some numbers in minds, my n is somewhere between 1 and 1000, and my p somewhere between 1 and 200. The coverage pattern is really 'blocky', because the records come in classes for which the fields asked are similar. Unfortunately I can't access the class of a record...
Question 1: Has someone an idea, a clever simplification, or a reference for a paper that could be useful? As I have a lot of requests, an algorithm that works well on average is what I am looking for (but I can't afford it to work very poorly on some extreme case, for example requesting the whole matrix when n and p are large, and the request is indeed quite sparse).
Question 2: In fact, the problem is even more complicated: the cost is in fact more like the form: a + n * (p^b) + c * n' * p'
, where b is a constant < 1 (once a record is asked for a field, it is not too costly to ask for other fields) and n' * p' = n * p * (1 - g)
is the number of cells I don't want to request (because they are invalid, and there is an additional cost in requesting invalid things). I can't even dream of a rapid solution to this problem, but still... an idea anyone?