Finding all combinations based on multiple conditions for a large list
Asked Answered
H

2

17

I am trying to calculate the optimal team for a Fantasy Cycling game. I have a csv-file containing 176 cyclist, their teams, the amount of points they have scored and the price to put them in my team. I am trying to find the highest scoring team of 16 cyclists.

The rules that apply to the composition of any team are:

  • The total cost of the team can't exceed 100.
  • No more than 4 cyclists from the same team can be in a fantasy team.

A short excerpt of my csv-file can be found below.

THOMAS Geraint,Team INEOS,142,13
SAGAN Peter,BORA - hansgrohe,522,11.5
GROENEWEGEN Dylan,Team Jumbo-Visma,205,11
FUGLSANG Jakob,Astana Pro Team,46,10
BERNAL Egan,Team INEOS,110,10
BARDET Romain,AG2R La Mondiale,21,9.5
QUINTANA Nairo,Movistar Team,58,9.5
YATES Adam,Mitchelton-Scott,40,9.5
VIVIANI Elia,Deceuninck - Quick Step,273,9.5
YATES Simon,Mitchelton-Scott,13,9
EWAN Caleb,Lotto Soudal,13,9

The simplest way of solving this problem would be to generate a list of all possible combinations of teams, then apply the rules to exclude teams that do not comply with the aforementioned rules. After this it is simple to calculate the total score for each team and pick the highest scoring one. In theory, generating the list of usable teams can be achieved by the code below.

database_csv = pd.read_csv('renner_db_example.csv')
renners = database_csv.to_dict(orient='records')

budget = 100
max_same_team = 4
team_total = 16

combos = itertools.combinations(renners,team_total)
usable_combos = []

for i in combos:
    if sum(persoon["Waarde"] for persoon in i)  < budget and all(z <= max_same_team for z in [len(list(group)) for key, group in groupby([persoon["Ploeg"] for persoon in i])]) == True:   
    usable_combos.append(i)    

However, calculating all combinations of a list of 176 cyclists into teams of 16 is something that is just too many calculations for my computer to handle, even though the answer to this question implies something else. I have done some research and could not find any way to apply the aforementioned conditions without having to iterate through every option.

Is there a way to find the optimal team, either by making the above approach work or by using an alternative approach?

Edit: In text, the full CSV-file can be found here

Halftone answered 17/7, 2019 at 14:36 Comment(7)
This is probably a fundamentally hard problem. It looks like a variant of the knapsack problem: en.wikipedia.org/wiki/Knapsack_problemPinckney
I have added a link to the full CSV file.Halftone
I believe in order to generate all combinations first, and then prune the teams that are invalid, you would first have to generate 20062118235172477959495 combinations (n choose k where n is 176 and k is 16). This value represents the number of possible combinations in which 16 elements are chosen out of a possible 176.Douzepers
Looks like a discrete optimization problem. Maybe you find some inspiration here or here.Derickderide
The purpose is to maximize total points by keeping total cost < 100 ?Disapproval
@AjaySrivastava Yes. Total cost must be equal to or below 100.Halftone
The obvious question is: which constraint is the more restricting: max_4_from_same_team, or total_cost<=100? Given that Value=6.5 is up at the 86th quantile, max_4_from_same_team is more restraining. We could use a generator-based solution to yield those.Vassily
S
13

It is a classical operations research problem.

There are tons of algorithms that permit to find an optimal (or just a very good depending on the algorithm) solution :

  • Mixed-Integer Programming
  • Metaheuristics
  • Constraint Programming
  • ...

Here is a code that will find the optimal solution using MIP, ortools library and default solver COIN-OR :

from ortools.linear_solver import pywraplp
import pandas as pd


solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)    
cyclist_df = pd.read_csv('cyclists.csv')

# Variables

variables_name = {}
variables_team = {}

for _, row in cyclist_df.iterrows():
    variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
    if row['Ploeg'] not in variables_team:
        variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))

# Constraints

# Link cyclist <-> team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, solver.infinity())
    constraint.SetCoefficient(var, 1)
    for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
        constraint.SetCoefficient(variables_name[cyclist], -1)

# Max 4 cyclist per team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, 4)
    constraint.SetCoefficient(var, 1)

# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
    constraint_max_cyclists.SetCoefficient(cyclist, 1)

# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
    constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])    

# Objective 
objective = solver.Objective()
objective.SetMaximization()

for _, row in cyclist_df.iterrows():
    objective.SetCoefficient(variables_name[row['Naam']], row['Punten totaal:'])

# Solve and retrieve solution     
solver.Solve()

chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]

print(cyclist_df[cyclist_df.Naam.isin(chosen_cyclists)])

Prints :

    Naam                Ploeg                       Punten totaal:  Waarde
1   SAGAN Peter         BORA - hansgrohe            522             11.5
2   GROENEWEGEN         Dylan   Team Jumbo-Visma    205             11.0
8   VIVIANI Elia        Deceuninck - Quick Step     273             9.5
11  ALAPHILIPPE Julian  Deceuninck - Quick Step     399             9.0
14  PINOT Thibaut       Groupama - FDJ              155             8.5
15  MATTHEWS Michael    Team Sunweb                 323             8.5
22  TRENTIN Matteo      Mitchelton-Scott            218             7.5
24  COLBRELLI Sonny     Bahrain Merida              238             6.5
25  VAN AVERMAET Greg   CCC Team                    192             6.5
44  STUYVEN Jasper      Trek - Segafredo            201             4.5
51  CICCONE Giulio      Trek - Segafredo            153             4.0
82  TEUNISSEN Mike      Team Jumbo-Visma            255             3.0
83  HERRADA Jesús       Cofidis, Solutions Crédits  255             3.0
104 NIZZOLO Giacomo     Dimension Data              121             2.5
123 MEURISSE Xandro     Wanty - Groupe Gobert       141             2.0
151 TRATNIK Jan Bahrain Merida                      87              1.0

How does this code solve the problem ? As @KyleParsons said, it looks like the knapsack problem and can be modelized using Integer Programming.

Let's define variables Xi (0 <= i <= nb_cyclists) and Yj (0 <= j <= nb_teams).

Xi = 1 if cyclist n°i is chosen, =0 otherwise

Yj = n where n is the number of cyclists chosen within team j

To define the relation between those variable, you can model these constraints :

# Link cyclist <-> team
For all j, Yj >= sum(Xi, for all i where Xi is part of team j)

To select only 4 cyclists per team max, you create these constraints :

# Max 4 cyclist per team
For all j, Yj <= 4

To select 16 cyclists, here are the associated constraints :

# Min 16 cyclists 
sum(Xi, 1<=i<=nb_cyclists) >= 16
# Max 16 cyclists 
sum(Xi, 1<=i<=nb_cyclists) <= 16

The cost constraint :

# Max cost 
sum(ci * Xi, 1<=i<=n_cyclists) <= 100 
# where ci = cost of cyclist i

Then you can maximize

# Objective
max sum(pi * Xi, 1<=i<=n_cyclists)
# where pi = nb_points of cyclist i

Notice that we model the problem using linear objective and linear inequation constraints. If Xi and Yj would be continous variables, this problem would be polynomial (Linear programming) and could be solved using :

Because these variables are integers (Integer Programming or Mixed Integer Programming), the problem is known as be part of NP_complete class (cannot be solved using polynomial solutions unless you are a genious). Solvers like COIN-OR use complex Branch & Bound or Branch & Cut methods to solve them efficiently. ortools provides a nice wrapper to use COIN with python. These tools are free & open source.

All these methods have the advantage of finding an optimal solution without iterating on all the possible solutions (and considerably reduce the combinatorics).

Sulfonate answered 17/7, 2019 at 15:39 Comment(4)
Very interesting, thanks so much for your comprehensive answer!Halftone
I think the explanation you added is quite clear. However, I find the documentation provided here to be a little sparse. My eventual aim is to find the optimal team of 16, where for every stage the best 10 riders are selected. The CSV I posted was actually modified, my original one also contains a list for each rider with their score for each stage. This list looks like this [0, 40, 13, 0, 2, 55, 1, 17, 0, 14]. Would I be able to do this by simply modifying the objective, or would I have to alter the code massively?Halftone
This is exactly what is hard about this problem. I am trying to find the team that performs the best overall. So I have a pool of 16 cyclists, from which the score of 10 cyclists counts towards the score of each day. The scores for each day are then summed to get a total score. The purpose is to get this final total score as high as possible. I am currently thinking this could be achieved by setting a coefficient for the objective for each stage, but can't quite figure out how to implement the fact that only the score of the ten best riders for each stage counts towards the final score.Halftone
@Halftone I posted a new answer for the new problem.Sulfonate
S
2

I add an other answer for your question :

The CSV I posted was actually modified, my original one also contains a list for each rider with their score for each stage. This list looks like this [0, 40, 13, 0, 2, 55, 1, 17, 0, 14]. I am trying to find the team that performs the best overall. So I have a pool of 16 cyclists, from which the score of 10 cyclists counts towards the score of each day. The scores for each day are then summed to get a total score. The purpose is to get this final total score as high as possible.

If you think I should edit my first post please let me know, I think that it is more clear like this because my first post is quite dense and answers the initial question.

Let's introduce a new variable :

Zik = 1 if cyclist i is selected and is one of the top 10 in your team on day k

You need to add these constraints to link variables Zik and Xi (variable Zik cannot be = 1 if cyclist i is not selected i.e if Xi = 0)

For all i, sum(Zik, 1<=k<=n_days) <= n_days * Xi

And these constraints to select 10 cyclists per day :

For all k, sum(Zik, 1<=i<=n_cyclists) <= 10

Finally, your objective could be written like this :

Maximize sum(pik * Xi * Zik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k

And here is the thinking part. An objective written like this is not linear (notice the multiplication between the two variables X and Z). Fortunately, there are both binaries and there is a trick to transform this formula to its linear form.

enter image description here

Let's introduce again new variables Lik (Lik = Xi * Zik) to linearize the objective.

The objective can now be written like this and be linear :

Maximize sum(pik * Lik, 1<=i<=n_cyclists, 1 <= k <= n_days)
# where pik = nb_points of cyclist i at day k

And we need now to add these constraints to make Lik equal to Xi * Zik :

For all i,k : Xi + Zik - 1 <= Lik
For all i,k : Lik <= 1/2 * (Xi + Zik)

And voilà. This is the beauty of mathematics, you can model a lot of things with linear equations. I presented advanced notions and it is normal if you don't understand them at first glance.


I simulated the score per day column on this file.

Here is the Python code to solve the new problem :

import ast
from ortools.linear_solver import pywraplp
import pandas as pd


solver = pywraplp.Solver('cyclist', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
cyclist_df = pd.read_csv('cyclists_day.csv')
cyclist_df['Punten_day'] = cyclist_df['Punten_day'].apply(ast.literal_eval)

# Variables
variables_name = {}
variables_team = {}
variables_name_per_day = {}
variables_linear = {}

for _, row in cyclist_df.iterrows():
    variables_name[row['Naam']] = solver.IntVar(0, 1, 'x_{}'.format(row['Naam']))
    if row['Ploeg'] not in variables_team:
        variables_team[row['Ploeg']] = solver.IntVar(0, solver.infinity(), 'y_{}'.format(row['Ploeg']))

    for k in range(10):
        variables_name_per_day[(row['Naam'], k)] = solver.IntVar(0, 1, 'z_{}_{}'.format(row['Naam'], k))
        variables_linear[(row['Naam'], k)] = solver.IntVar(0, 1, 'l_{}_{}'.format(row['Naam'], k))

# Link cyclist <-> team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, solver.infinity())
    constraint.SetCoefficient(var, 1)
    for cyclist in cyclist_df[cyclist_df.Ploeg == team]['Naam']:
        constraint.SetCoefficient(variables_name[cyclist], -1)

# Max 4 cyclist per team
for team, var in variables_team.items():
    constraint = solver.Constraint(0, 4)
    constraint.SetCoefficient(var, 1)

# Max cyclists
constraint_max_cyclists = solver.Constraint(16, 16)
for cyclist in variables_name.values():
    constraint_max_cyclists.SetCoefficient(cyclist, 1)

# Max cost
constraint_max_cost = solver.Constraint(0, 100)
for _, row in cyclist_df.iterrows():
    constraint_max_cost.SetCoefficient(variables_name[row['Naam']], row['Waarde'])

# Link Zik and Xi
for name, cyclist in variables_name.items():
    constraint_link_cyclist_day = solver.Constraint(-solver.infinity(), 0)
    constraint_link_cyclist_day.SetCoefficient(cyclist, - 10)
    for k in range(10):
        constraint_link_cyclist_day.SetCoefficient(variables_name_per_day[name, k], 1)

# Min/Max 10 cyclists per day
for k in range(10):
    constraint_cyclist_per_day = solver.Constraint(10, 10)
    for name in cyclist_df.Naam:
        constraint_cyclist_per_day.SetCoefficient(variables_name_per_day[name, k], 1)

# Linearization constraints 
for name, cyclist in variables_name.items():
    for k in range(10):
        constraint_linearization1 = solver.Constraint(-solver.infinity(), 1)
        constraint_linearization2 = solver.Constraint(-solver.infinity(), 0)

        constraint_linearization1.SetCoefficient(cyclist, 1)
        constraint_linearization1.SetCoefficient(variables_name_per_day[name, k], 1)
        constraint_linearization1.SetCoefficient(variables_linear[name, k], -1)

        constraint_linearization2.SetCoefficient(cyclist, -1/2)
        constraint_linearization2.SetCoefficient(variables_name_per_day[name, k], -1/2)
        constraint_linearization2.SetCoefficient(variables_linear[name, k], 1)

# Objective 
objective = solver.Objective()
objective.SetMaximization()

for _, row in cyclist_df.iterrows():
    for k in range(10):
        objective.SetCoefficient(variables_linear[row['Naam'], k], row['Punten_day'][k])

solver.Solve()

chosen_cyclists = [key for key, variable in variables_name.items() if variable.solution_value() > 0.5]

print('\n'.join(chosen_cyclists))

for k in range(10):
    print('\nDay {} :'.format(k + 1))
    chosen_cyclists_day = [name for (name, day), variable in variables_name_per_day.items() 
                       if (day == k and variable.solution_value() > 0.5)]
    assert len(chosen_cyclists_day) == 10
    assert all(chosen_cyclists_day[i] in chosen_cyclists for i in range(10))
    print('\n'.join(chosen_cyclists_day))

Here are the results :

Your team :

SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
BENOOT Tiesj
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro
GRELLIER Fabien

Selected cyclists per day

Day 1 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús

Day 2 :
SAGAN Peter
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
NIZZOLO Giacomo
MEURISSE Xandro

Day 3 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
MATTHEWS Michael
TRENTIN Matteo
VAN AVERMAET Greg
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús

Day 4 :
SAGAN Peter
VIVIANI Elia
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús

Day 5 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
CICCONE Giulio
HERRADA Jesús

Day 6 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
TRENTIN Matteo
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike

Day 7 :
SAGAN Peter
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
COLBRELLI Sonny
VAN AVERMAET Greg
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
MEURISSE Xandro

Day 8 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
MATTHEWS Michael
STUYVEN Jasper
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo
MEURISSE Xandro

Day 9 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
ALAPHILIPPE Julian
PINOT Thibaut
TRENTIN Matteo
COLBRELLI Sonny
VAN AVERMAET Greg
TEUNISSEN Mike
HERRADA Jesús

Day 10 :
SAGAN Peter
GROENEWEGEN Dylan
VIVIANI Elia
PINOT Thibaut
COLBRELLI Sonny
STUYVEN Jasper
CICCONE Giulio
TEUNISSEN Mike
HERRADA Jesús
NIZZOLO Giacomo

Let's compare the results of answer 1 and answer 2 print(solver.Objective().Value()):

You get 3738.0 with the first model, 3129.087388325567 with the second one. The value is lower because you select only 10 cyclists per stage instead of 16.

Now if keep the first solution and use the new scoring method, we get 3122.9477585307413

We could consider that the first model is good enough : we didn't have to introduce new variables/constraints, model stays simple and we got a solution almost as good as the complex model. Sometimes it is not necessary to be 100% accurate and a model can be solved more easily and quickly with some approximations.

Sulfonate answered 22/7, 2019 at 10:41 Comment(3)
Thanks so much! This looks great. However, when I try to run this with my CSV with the correct scores per day, I get an AssertionError for day 4, where it only picks 9 riders. I can't quite figure out what is causing this as I haven't had time to fully dive into your code yet. Do you have any idea what could be causing this? I have pasted the CSV I am using here.Halftone
I have found what went wrong. The constraint for cyclists per day was set to be at max 10 cyclist, but this also makes it possible for it to be less. Therefore this constrain should be constraint_cyclist_per_day = solver.Constraint(10, 10), so it has no other option than to pick 10 cyclists. Just for clarity, this should be line 56 in your code.Halftone
@Halftone exactly ! On day 5, only 6 cyclists have points > 0. There is no need for the actual model to select more than 6. Your fix is perfect, glad you have found it.Sulfonate

© 2022 - 2024 — McMap. All rights reserved.