I'm so far out of my league on this one, so I'm hoping someone can point me in the right direction. I think this is an optimization problem, but I have been confused by scipy.optimize and how it fits with pulp. Also, matrix math boggles my mind. Therefore this problem has really been slowing me down without to ask.
Problem Statement: I have a dataset of customers. For each customer, I can make a choice of 3 options, or choose none. So 4 options. Also for each customer, I have a numeric score that says how "good" each choice is. You can imagine this value as the probability of the Choice to create a future sale
.
# fake data for the internet
data = {'customerid':[101,102,103,104,105,106,107,108,109,110],
'prob_CHOICEA':[0.00317,0.00629,0.00242,0.00253,0.00421,0.00414,0.00739,0.00549,0.00658,0.00852],
'prob_CHOICEB':[0.061,0.087,0.055,0.027,0.022,0.094,0.099,0.072,0.018,0.052],
'prob_CHOICEC':[0.024,0.013,0.091,0.047,0.071,0.077,0.067,0.046,0.077,0.044]
}
# Creates pandas DataFrame
df = pd.DataFrame(data)
df = df.reset_index(drop=True).set_index(['customerid'])
+------------+--------------+--------------+--------------+
| customerid | prob_CHOICEA | prob_CHOICEB | prob_CHOICEC |
+------------+--------------+--------------+--------------+
| 101 | 0.00317 | 0.061 | 0.024 |
| 102 | 0.00629 | 0.087 | 0.013 |
| 103 | 0.00242 | 0.055 | 0.091 |
| 104 | 0.00253 | 0.027 | 0.047 |
| 105 | 0.00421 | 0.022 | 0.071 |
| 106 | 0.00414 | 0.094 | 0.077 |
| 107 | 0.00739 | 0.099 | 0.067 |
| 108 | 0.00549 | 0.072 | 0.046 |
| 109 | 0.00658 | 0.018 | 0.077 |
| 110 | 0.00852 | 0.052 | 0.044 |
+------------+--------------+--------------+--------------+
I started by combining these elements into a single array for each customer:
# combine all values into 1 array
list_to_combine = ['prob_CHOICEA', 'prob_CHOICEB','prob_CHOICEC']
df['probs_A_B_C']= df[list_to_combine].values.tolist()
df.drop(list_to_combine, axis=1, inplace=True)
+------------+-------------------------+
| customerid | probs_A_B_C |
+------------+-------------------------+
| 101 | [0.00317, 0.061, 0.024] |
| 102 | [0.00629, 0.087, 0.013] |
| 103 | [0.00242, 0.055, 0.091] |
| 104 | [0.00253, 0.027, 0.047] |
| 105 | [0.00421, 0.022, 0.071] |
| 106 | [0.00414, 0.094, 0.077] |
| 107 | [0.00739, 0.099, 0.067] |
| 108 | [0.00549, 0.072, 0.046] |
| 109 | [0.00658, 0.018, 0.077] |
| 110 | [0.00852, 0.052, 0.044] |
+------------+-------------------------+
For each customer, I only have 4 choices of what to do:
choices = [
[0,0,0],
[1,0,0],
[0,1,0],
[0,0,1]
]
For each customer, I want to choose the best choice for each customer. At first glance this was easy - just pick the highest number. However it starts to blow my mind once I start adding constraints.
For example, what if I want to choose the best choice for each customer, but with the constraint that the sum of selected choices is = 5
+------------+-------------------------+-------------+
| customerid | probs_A_B_C | best_choice |
+------------+-------------------------+-------------+
| 101 | [0.00317, 0.061, 0.024] | [0,0,0] |
| 102 | [0.00629, 0.087, 0.013] | [0,1,0] |
| 103 | [0.00242, 0.055, 0.091] | [0,0,1] |
| 104 | [0.00253, 0.027, 0.047] | [0,0,0] |
| 105 | [0.00421, 0.022, 0.071] | [0,0,0] |
| 106 | [0.00414, 0.094, 0.077] | [0,1,0] |
| 107 | [0.00739, 0.099, 0.067] | [0,1,0] |
| 108 | [0.00549, 0.072, 0.046] | [0,0,0] |
| 109 | [0.00658, 0.018, 0.077] | [0,0,1] |
| 110 | [0.00852, 0.052, 0.044] | [0,0,0] |
+------------+-------------------------+-------------+
I did not even figure out how to do this, I just eye-balled it manually for illustrative purposes.
Ideally, I'd like to add multiple constraints simultaneously:
- Total sum of best_choice = N
- Total sum of CHOICEA (first element of best_choice) >= M
- Total sum of CHOICEB (second element of best_choice) <= 10
Any ideas of where to start?