Generate teams of similar strengths with contiguous areas
Asked Answered
D

2

12

Idea

It's quite sad that so many great countries (e.g. India) and players (e.g. Mo Salah) may never play at the FIFA (football/soccer) World Cup (the same argument could apply as well to other sports events dominated by a handful of dominant teams, such international cricket and basketball tournaments).

enter image description here

It would be neat to try and create a more balanced event, while still keeping the location-based element, where each team is of a (roughly) similar strength, with all of the players from a contiguous area (preferably of the same landmass, but obviously not possible in all circumstances, and beyond the scope of this question).

For example, in the case of a weaker region, maybe their team would span many countries (e.g. maybe a South Asian team), whereas for a strong country, maybe there would be multiple teams (e.g. a team of just the players from the suburbs of Paris).

I could then afterwards plot the areas, maybe using Voronois.

Background

I managed to gather the relevant information, through scraping (of Football Manager and Transfermarkt), but I'm stuck on how to design the algorithm to select the teams.

Problem

There is a large number of coordinates, which correspond to places of birth of players. These places all have players, and the players all have ratings (from 1 - 100) and positions.

The task is, given a certain team size (11), and a certain number of teams (in the example below, 10), and a certain number of required players in each position (1, though substitutes would be handy), divide the area up into contiguous areas where the best team you can form from the players of the given area has roughly equal skill to the teams of other areas.

Question

I've been reading a bit about graph theory things, but I'm unsure where to begin creating a program which solves this problem.

I was looking for some guidance on this, and (if it's possible), if you could create something with the toy example, that would be amazing!!

Also, if tackling the overall problem is too difficult, if you can find a way to narrow the problem, and can create something which addresses that smaller problem, which can then be generalised to the larger problem, that would be great too.

Sample code (in R, but I've included Python and Julia equivalent code further below- an answer using one of these (or similar) would be great)

set.seed(0)

library(tidyverse)

df <- tibble(
    # generate random latitudes and longitudes, and the number of players for that location
    lat = runif(100, 0, 90),
    lon = runif(100, 0, 180),
    players = rnorm(100, 0, 5) |> abs() |> ceiling() |> as.integer()
)

num_positions <- 11

position <- letters[1:num_positions]

df <- df |>
   # generate position and skill data, and unnest
  mutate(position = map(players, ~sample(position, .x, replace = TRUE)),
         skill = map(players, ~sample(1:100, .x, replace = TRUE)))|>
         unnest_longer(c(position, skill))

# plot the data
df |> 
  summarise(skill = mean(skill), players = first(players), .by = c(lat, lon)) |>
  ggplot(aes(x = lon, y = lat)) +
    geom_point(aes(size = players, color = skill)) +
    scale_size_continuous(range = c(1, 10)) +
    scale_color_viridis_c(option = "magma") + # similar to the Colombian shirt colours
    theme_minimal() +
    theme(legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

n_teams <- 10

Scatter plot of the sample data (circle size is number of players, colour is average skill) scatter plot of sample data

Notes

  1. In the thing I want to do, it would involve 64 teams, so at least 704 players, and probably around 3x the size of the sample dataset. The real dataset has a lot of rows, but by filtering it, I should be able to get it down to a few thousand.

  2. In real life, some players can play well in more than one position, but in the example code I gave, each player only has one position. Adding multiple positions per player would likely make this a lot more difficult to solve, so it's outside the scope of this question.

  3. If you can do it across a sphere (like the globe), that would be amazing, but a rectangle would be okay too.

  4. Reinderien has helpfully pointed out that using contiguousness without compactness could lead to some Amigara Fault-like jerrymandered monstrosities, similar to what we see in US electoral maps. So although I'm still trying to figure out a better title, I would say prioritise compactness over continguity if it works better for you (though I'm guessing this would throw up the problem of exclaves?)

Update:

Python code:
import random, math, pandas

random.seed(0)

df = pandas.DataFrame({'lat': [random.uniform(0, 90) for i in range(100)],
                'lon': [random.uniform(0, 180) for i in range(100)],
                'players':[math.ceil(abs(random.normalvariate(0, 5))) for i in range(100)]})

num_players = 11

positions = list(map(chr, range(97, 97+num_players)))

df['position'] = df['players'].apply(lambda x: random.choices(positions, k=x))
df['skill'] = df['players'].apply(lambda x: random.choices(range(1, 101), k=x))
Julia code:
using Random, DataFrames
Random.seed!(0)

df = DataFrame(lat = rand(100) * 90, 
               lon = rand(100) * 180, 
               players = Int64.(ceil.(abs.(5randn(100)))))

num_positions = 11

position = ['a'+ i for i in 0:num_positions-1]

df[!, :position] = [rand(position, players) for players in df.players]
df[!, :skill] = [rand(1:100, players) for players in df.players]

df = flatten(df, [:position, :skill])
Dilapidate answered 14/8, 2023 at 4:8 Comment(18)
to the person who voted to close because it needs more focus- I'm literally only asking about one aspect (the algorithm), the rest is contextDilapidate
if people have suggestions for narrowing the problem though, please comment them, I'd be happy to hear them! :-)Dilapidate
It sounds like you could perhaps use simulated annealing or another geospatial clustering technique here. You might find this package helpful.Rand
@AllanCameron thanks for your comment! Simulated annealing looks good for finding a global maximum. Unsure how geospatial clustering would work with this problemDilapidate
Aside: as an Indian, World Cup immediately meant the Cricket World Cup to me, and I was wondering why India couldn't play in it. From websearching the name Haaland, it looks like he's a football player, so I guess you're talking about the FIFA World Cup 😀Ssw
hahah yes! I should have made that clearer!Dilapidate
there's also the Rugby World Cup, the Women's World Cup etc. See: en.wikipedia.org/wiki/List_of_world_cupsDilapidate
@SundarR my question could work for the cricket world cup though! It would be interesting seeing different regions face each other (especially when the most powerful teams, like India and Australia, are so much more powerful than the restDilapidate
@SundarR Apologies, we are just in Women's Football World Cup fever here in Australia and I'd assumed everyone else was the same way 😆Dilapidate
If I understand your objective correctly, you want to group the players (by size 10, for example) such that the formed teams has close strength, with the following constraints: (1) each team has all its needed roles (or skills); (2) the areas of teams should be contiguous.Amoebocyte
If I didn't miss any other constraints, generally speaking I would say that your question is not trivial at all since the content in constraint (1) has provided sufficient complexity to the problem itself. I guess it might be better to start from only one of the constraints, e.g., constraint (1), to see where we can reach and then decide how other constraints can be added.Amoebocyte
@Amoebocyte thank you so much for your message! You have interpreted it correctly. I know constraint #2 would add to the complexity of any answer, though I'm not sure how much it would (e.g. maybe it would just be one line to filter nodes connected by edges, idk?). I think it might increase speed (as instead of checking every node, you only check all neighbours (but I might be wrong!)Dilapidate
From the optimization perspective, you are right that more constraints narrows down the size of feasible set of arguments, thus speeding the procedure of solving the problem.Amoebocyte
I thought I could make it so that every place had only one player, but I worry that would then not generalise to a program where many players are born in the same place. But maybe that would indeed be okay, and there is some workaround I can't think ofDilapidate
@Amoebocyte the wrinkle is that if you are surrounded by places with terrible players, you would then have to look beyond the immediate nodes if you don't want to get a subpar outcomeDilapidate
@Dilapidate Yes, I admit that's a tricky partAmoebocyte
@Amoebocyte re. more constraints narrows down the size of feasible set of arguments, thus speeding the procedure of solving the problem - this is only really true if the constraints are in a pre-solve step prior to optimization. Otherwise, the opposite is generally true. The optimizer needs to work harder to find feasible solutions because they occupy a smaller portion of the parameter space.Entity
@Mark: Looking back, the question you posted reminds me of a previous question I had posted many months ago! #74947334Thyrotoxicosis
E
6

First let's talk about contiguity. I think that you should value compactness over contiguity; after all, the Americans have proven that anything can be contiguous if you believe it hard enough. Contiguity is a "hard problem" to represent and compactness is an "easy problem" to represent. Scalability becomes a challenge because you aren't performing a simple similarity clustering (e.g. k-means); you're performing clustering with much more complex criteria. This is not my specialty, and so in my demonstration I show a simplification that is stochastically correct: assuming player skill and position are all uniformly distributed, a random selection of team centroid based on population density that is also of uniform distribution will represent reality (especially if you run multiple outer iterations with random perturbation). Specifically, if you have time to burn and want to refine the solution, you can wrap this approach in a differential evolution loop with the team centroid as the evolved parameter.

Speaking of complex criteria - this system is only useful if you construct correct teams, and correct teams require position criteria. I present a matrix that you can adjust to set lower and upper bounds on the required number of positions per team.

It is possible - but potentially very slow - to tell an optimizer to narrow the range of allowed skill levels. You may be best off to perform this as a pre-solve step, eliminating potential players that are some selected number of standard deviations away from your target skill level. Either that, or represent fairness not as an LP objective but as an LP constraint, with bounds on the minimum and maximum skill sum per team. I demonstrate the latter; but note that this is most feasible when we constrain the mean team skill toward the mean skill overall (in this case 50). Defining a range to be too narrow (less than ~ 20%) or defining the target to be far away from the mean make solution difficult.

All together,

import geopandas
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pulp
import shapely
from numpy.random import default_rng, Generator

n_teams = 64
players_per_team = 11
total_players = n_teams * players_per_team
max_skill = 100


def synthesize_world_data(rand: Generator) -> tuple[
    geopandas.GeoDataFrame,  # player and country data
    pd.DataFrame,            # field position data
]:
    # Load the default world dataset from geopandas,
    # dropping a couple of countries with degenerate bounds.
    # There are better ways to fix this (not shown here).
    world = (
        geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
        .rename(columns={'name': 'country'})
    )
    world.index.name = 'country'
    world[['left', 'bottom', 'right', 'top']] = shapely.bounds(world['geometry'])
    is_degenerate = np.isclose(world['right'] - world['left'], 360)
    world = world[~is_degenerate]

    # Assume that talent is uniformly distributed throughout the world,
    # so players are spatially uniformly distributed. Draw from a pool
    # of players ~2 times what is needed for final team formation.
    pop_ratio = 2 * total_players / world['pop_est'].sum()
    world['players_est'] = world['pop_est'] * pop_ratio

    # Hack: perform an inefficient but statistically correct spatial random distribution of players
    # within their countries by allocating random coordinates within rectilinear bounds and then doing
    # geometric elimination of those players that have decided to live in the ocean
    world['geodensity'] = shapely.area(world['geometry'])/(world['right'] - world['left'])/(world['top'] - world['bottom'])
    n_soggy_players = (world['players_est'] / world['geodensity']).round().astype(int)
    soggy_allocated = pd.DataFrame(
        np.arange(n_soggy_players.max()) < n_soggy_players.values[:, np.newaxis],
        index=n_soggy_players.index,
        columns=pd.RangeIndex(name='country_player', start=0, stop=n_soggy_players.max()),
    ).replace(False, np.nan).stack()
    world, _ = world.align(soggy_allocated, axis=0)

    world['player_lon'] = rand.uniform(low=world['left'], high=world['right'])
    world['player_lat'] = rand.uniform(low=world['bottom'], high=world['top'])
    not_soggy = shapely.contains_xy(world['geometry'], world[['player_lon', 'player_lat']])
    world = world[not_soggy]

    # Assume that skill and position are uniformly distributed
    # https://en.wikipedia.org/wiki/Association_football_positions
    world['player_skill'] = rand.uniform(low=0, high=max_skill, size=len(world))
    positions = pd.DataFrame(
        {
            'min_players': [1, 2, 2, 2],
            'max_players': [1, 5, 5, 5],
        },
        index=pd.Index(name='name', data=['goalkeep', 'defender', 'midfield', 'forward']),
    )
    pos_mean = (positions['min_players'] + positions['max_players'])/2
    world['position'] = rand.choice(
        a=positions.index,
        p=pos_mean/pos_mean.sum(),
        size=len(world),
    )
    world.set_index(
        pd.RangeIndex(name='player', start=0, stop=len(world)),
        append=False, inplace=True,
    )

    return world, positions


def guess_team_centroids(rand: Generator, world: geopandas.GeoDataFrame) -> tuple[
    pd.DataFrame,  # centroids per team
    pd.Series,     # player-centroid distances
]:
    coord_fields = ['player_lat', 'player_lon']

    centroid_idx = rand.choice(a=world.index, size=n_teams)
    centroids = (
        world.loc[
            centroid_idx,
            coord_fields,
        ]
        .rename(columns={'player_lat': 'team_lat', 'player_lon': 'team_lon'})
        .set_index(pd.RangeIndex(name='team', start=0, stop=n_teams))
    )

    coords = np.deg2rad(
        pd.merge(
            how='cross', left=centroids.reset_index(),
            right=world[coord_fields].reset_index(),
        ).set_index(['team', 'player'])
    )

    # Haversine norm (how is this not in geopandas?)
    a = (
        np.sin((coords['player_lat'] - coords['team_lat'])/2)**2 +
        np.sin((coords['player_lon'] - coords['team_lon'])/2)**2
        * np.cos(coords['team_lat']) * np.cos(coords['player_lat'])
    )
    c = np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    r = (world.crs.ellipsoid.semi_major_metre + world.crs.ellipsoid.semi_minor_metre) * 1e-3

    distances = r*c
    return centroids, distances


def make_vars(world: geopandas.GeoDataFrame) -> pd.Series:
    team_idx = pd.RangeIndex(name='team', start=0, stop=n_teams)

    assign_idx = pd.MultiIndex.from_product((team_idx, world.index))
    names = assign_idx.to_frame().astype(str)
    assigns = (
        'asn_t' + names['team'] + '_p' + names['player']
    ).apply(pulp.LpVariable, cat=pulp.LpBinary)

    return assigns


def make_objective(
    assigns: pd.Series,
    distances: pd.Series,
) -> pulp.LpAffineExpression:
    # For compactness, minimize the distance from each player to their team's centroid
    return pulp.lpDot(assigns, distances)/total_players/100


def add_constraints(
    prob: pulp.LpProblem,
    world: geopandas.GeoDataFrame,
    positions: pd.DataFrame,
    assigns: pd.Series,
) -> None:
    for team, group in assigns.groupby('team'):
        # There must be 11 players per team
        prob.addConstraint(
            name=f'teamsize_t{team}',
            constraint=pulp.lpSum(group) == players_per_team,
        )

        # Enforce competitive team skill sum
        skill_deviation = pulp.lpDot(
            group,
            world.loc[group.index.get_level_values('player'), 'player_skill'],
        )/players_per_team - max_skill/2
        prob.addConstraint(
            name=f'skill_lo_t{team}',
            constraint=skill_deviation >= -10,
        )
        prob.addConstraint(
            name=f'skill_hi_t{team}',
            constraint=skill_deviation <= 10,
        )

    # Each player may only be assigned up to one team
    for player, group in assigns.groupby('player'):
        prob.addConstraint(
            name=f'playerexcl_p{player}',
            constraint=pulp.lpSum(group) <= 1,
        )

    # Enforce the team position bounds
    for position, (tmin, tmax) in positions.iterrows():
        pos_players = world.index[world['position'] == position]
        pos_assigns = assigns.loc[(slice(None), pos_players)]
        for team, group in pos_assigns.groupby('team'):
            total = pulp.lpSum(group)
            prob.addConstraint(
                name=f'poslo_t{team}_{position}',
                constraint=total >= tmin,
            )
            prob.addConstraint(
                name=f'poshi_t{team}_{position}',
                constraint=total <= tmax,
            )


def solve(prob: pulp.LpProblem, world: geopandas.GeoDataFrame, assigns: pd.Series) -> geopandas.GeoDataFrame:
    prob.solve()
    if prob.status != pulp.LpStatusOptimal:
        raise ValueError('Solution status', prob.status)

    assigns = (
        assigns.apply(pulp.LpVariable.value)
        .astype(int)
        .unstack(level='team')
    )

    team_player_idx, team_idx = assigns.values.nonzero()
    world.loc[team_player_idx, 'team'] = team_idx
    return world


def dump_solution(
    world: geopandas.GeoDataFrame,
) -> None:
    pd.set_option('display.max_rows', 1000)
    pd.set_option('display.max_columns', 1000)
    pd.set_option('display.width', 1000)
    world = world.loc[
        world['team'].notna(),
        ['continent', 'country', 'team', 'position', 'player_skill'],
    ]

    print('Players by country:')
    print(world.sort_values(['continent', 'country', 'team', 'position']), end='\n\n')

    print('Players by team and position:')
    print(world.sort_values(['team', 'position']), end='\n\n')

    print('Team summary:')
    grouped = world.groupby('team')
    teams = world.groupby(['team', 'position']).country.count().unstack('position')
    teams.insert(loc=0, column='continent', value=grouped['continent'].agg(pd.Series.mode).astype(str))
    teams.insert(loc=1, column='country', value=grouped['country'].agg(pd.Series.mode).astype(str))
    teams.insert(loc=2, column='skill', value=grouped['player_skill'].sum() / players_per_team)
    teams.sort_values(['continent', 'country', 'skill'], inplace=True)
    print(teams, end='\n\n')


def plot_solution(world: geopandas.GeoDataFrame) -> plt.Figure:
    fig, ax = plt.subplots()
    world['geometry'].boundary.plot(ax=ax)

    for team, group in world.groupby('team'):
        ax.scatter(group['player_lon'], group['player_lat'])

    return fig


def main() -> None:
    rand = default_rng(seed=0)

    print('Synthesizing world data...')
    world, positions = synthesize_world_data(rand)
    centroids, distances = guess_team_centroids(rand, world)

    print('Making assignment variables...')
    assigns = make_vars(world)
    prob = pulp.LpProblem(name='football_clustering', sense=pulp.LpMinimize)

    print('Defining objective...')
    prob.objective = make_objective(assigns, distances)

    print('Adding constraints...')
    add_constraints(prob, world, positions, assigns)

    # print(prob)
    print('Solving...')
    world = solve(prob, world, assigns)

    dump_solution(world)
    plot_solution(world)
    plt.show()


if __name__ == '__main__':
    main()

example global output

...
Team summary:
position      continent                                   country      skill  defender  forward  goalkeep  midfield
team                                                                                                               
6.0              Africa                                   Algeria  44.963361         3        2         1         5
2.0              Africa                              Burkina Faso  45.802411         5        3         1         2
56.0             Africa                           Dem. Rep. Congo  47.018698         3        2         1         5
10.0             Africa                                  Ethiopia  40.294830         3        4         1         3
11.0             Africa                                  Ethiopia  43.277617         2        5         1         3
38.0             Africa                                Madagascar  59.876627         3        2         1         5
54.0             Africa                                   Morocco  55.331606         3        2         1         5
63.0             Africa                                   Nigeria  40.088676         4        3         1         3
55.0             Africa                              South Africa  55.269027         3        4         1         3
62.0             Africa                                    Uganda  47.396431         3        2         1         5
36.0             Africa                                    Uganda  48.577035         2        5         1         3
35.0             Africa  ['Burundi' 'Dem. Rep. Congo' 'Tanzania']  53.517437         3        4         1         3
20.0               Asia                                Bangladesh  57.757983         4        2         1         4
22.0               Asia                                Bangladesh  58.492236         5        3         1         2
3.0                Asia                                     China  40.099363         2        4         1         4
9.0                Asia                                     China  40.716526         3        2         1         5
34.0               Asia                                     China  41.488606         4        2         1         4
53.0               Asia                                     China  42.986541         4        4         1         2
7.0                Asia                                     China  43.353136         2        5         1         3
49.0               Asia                                     China  44.045816         3        2         1         5
1.0                Asia                                     China  45.908339         3        3         1         4
52.0               Asia                                     China  49.910424         5        3         1         2
40.0               Asia                                     China  50.852176         3        2         1         5
14.0               Asia                                     China  53.105643         3        2         1         5
15.0               Asia                                     China  53.245060         4        2         1         4
23.0               Asia                                     China  57.018255         4        2         1         4
45.0               Asia                                     China  58.395190         3        2         1         5
57.0               Asia                                     China  59.197112         2        3         1         5
26.0               Asia                                     China  59.516640         3        4         1         3
21.0               Asia                                     China  59.535480         4        3         1         3
44.0               Asia                                     India  41.307787         2        5         1         3
48.0               Asia                                     India  41.448672         2        5         1         3
37.0               Asia                                     India  41.713956         2        3         1         5
5.0                Asia                                     India  44.313757         2        5         1         3
19.0               Asia                                     India  47.688853         2        5         1         3
39.0               Asia                                     India  53.535032         4        4         1         2
43.0               Asia                                     India  54.314630         2        3         1         5
50.0               Asia                                     India  56.103479         2        5         1         3
0.0                Asia                                     India  56.155755         2        3         1         5
59.0               Asia                                     India  56.180263         4        4         1         2
17.0               Asia                                     India  58.247667         2        5         1         3
16.0               Asia                                     India  58.393727         2        4         1         4
31.0               Asia                                     India  58.884043         3        3         1         4
41.0               Asia                                     India  59.004900         2        3         1         5
12.0               Asia                                     India  59.838218         2        4         1         4
28.0               Asia                                 Indonesia  44.747853         4        4         1         2
27.0               Asia                                 Indonesia  53.665286         4        2         1         4
60.0               Asia                                     Japan  44.237347         4        2         1         4
4.0                Asia                                  Pakistan  41.385745         2        5         1         3
58.0               Asia                               Philippines  42.990101         2        3         1         5
42.0               Asia                              Saudi Arabia  42.746346         4        2         1         4
32.0               Asia                                    Turkey  49.816575         3        4         1         3
51.0               Asia                        ['China' 'Taiwan']  42.435033         2        4         1         4
47.0               Asia                 ['Pakistan' 'Tajikistan']  45.447561         4        2         1         4
25.0             Europe                                    France  53.116500         5        2         1         3
30.0             Europe                                   Germany  40.177276         3        5         1         2
13.0             Europe                                     Italy  46.469427         3        2         1         5
61.0             Europe                                    Poland  46.816901         3        5         1         2
33.0             Europe                                   Romania  58.854495         3        3         1         4
29.0      North America                                    Mexico  49.474287         2        5         1         3
46.0      North America                                    Mexico  55.503530         3        2         1         5
24.0      North America                  United States of America  40.777742         2        4         1         4
8.0       South America                                    Brazil  55.562492         5        2         1         3
18.0      South America                    ['Brazil' 'Venezuela']  53.190197         3        5         1         2
Entity answered 21/8, 2023 at 23:35 Comment(6)
Thank you so much for your answer! A couple of responses to some things you mentioned in your question- 1. yes, compactness is a better metric! I didn't think of the issue of gerrymandering. I'll add a note about this to the questionDilapidate
2. re: pre-solve step - yes, I totally agree - that was the intention behind my comment in the notes section of the question about filtering the number of players down to a few thousand. 3. n_soggy_players made laugh out loudDilapidate
4. I can't tell if in your example if players can occupy the same coordinates, but in case they don't, in the real example they do (e.g. because a player's place of birth is just listed as the closest city). One thing I was unsure about was whether it's better to create something within the algorithm to take advantage of this (e.g. if you can kill two birds with one stone by expanding an area to include a location with a couple of players that improve the team) or if it would be easier/better to just include a bit of jitter to the data and use something more basicDilapidate
Yes, multiple players can occupy one locationEntity
Re. jitter - that should not be necessary. By the current objective, overlapping player coordinates actually improves the solution (if the players get assigned the same team). The only time it might become a problem is if you attempt Voronoi and overlapping players are on different teams.Entity
re:overlapping players are on different teams - players in the same place on different teams runs counter to the whole idea!! Though maybe it is unavoidable. A big part of the intractableness for me was that there are the two layers to the problem, the places and the players. Getting the two to agree is something of a mindfuckDilapidate
A
3

Disclaimer

Sorry that I don't have a real and exact solution for your full question so far (as we discussed in the comments, the question involves with many factors/constraints and seems not easy to tackle it within just one shot), but I guess probably we can depart from a minimal example to see if we can find some clues.

A Minimal Sub-Problem (with one of constraints only)

If I understand your objective correctly, you want to group the players such that the formed teams has close strength, along with the following constraints:

  1. Each team should be of equal size, i.e., 11.
  2. Each team has all its needed roles (or skills);
  3. The areas of teams should be contiguous.

From the optimization perspective, it is true that adding more constraints can somewhat accelerate the problem solving procedure, since the size of feasible set of variables is narrowed down, but it dramatically increases the complexity of formulating the optimization problem in the meanwhile. (as corrected by @Reinderien in the comment: "this is only really true if the constraints are in a pre-solve step prior to optimization. Otherwise, the opposite is generally true. The optimizer needs to work harder to find feasible solutions because they occupy a smaller portion of the parameter space.")

I am not aiming to completely solve your problem within this single answer, but would like to take a tiny bite from a minimal sub-problem of yours and see the possibilities for adding more constraints.

If we focus on the the objective plus the constraint 1) only, the problem can be translated as a classical integer programming problem:

Given a numeric vector vec and number of groups grpNr, how to evenly allocate the entries of vec into grpNr groups (grpNr is an integer that always divides length(vec)), such that the sums of groups gives the minimal difference.

Problem Formulation and Solving

There are lots of options for solving optimization problems in R, and here is just one implementation with CVXR package for example

library(CVXR)

# Input numeric vector where entries are to be grouped
set.seed(0)
vec <- runif(20, 0, 10)

# Given number of groups
grpNr <- 5

# Define optimization variables: Binary variables indicate
x <- Variable(length(vec), grpNr, boolean = TRUE)

# Define objective: minimize the difference in group sums
obj <- Minimize(norm(t(vec) %*% x - sum(vec) / grpNr))

# Constraint: each item must be in exactly one group
constrs <- list(
    sum_entries(x, 1) == 1,
    sum_entries(x, 2) == length(vec) / grpNr
)

# Formulate and solve the problem
prb <- Problem(obj, constrs)
res <- solve(prb)

# Info of resulting groups
grpInfo <- apply(round(res$getValue(x)) > 0, 2, which, simplify = FALSE)

and we can obtain a the grouping information

> grpInfo
[[1]]
[1]  1  3  8 11

[[2]]
[1]  7 12 13 19

[[3]]
[1]  6 10 14 18

[[4]]
[1]  4 16 17 20

[[5]]
[1]  2  5  9 15

and the distribution of group sum

> vec %*% round(res$getValue(x))
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 22.75283 22.72827 22.35437 22.20429 22.18618

Way Forward

To be honest, my capability is limited by my knowledge, so I have no idea if I am on the right track or where shall we go from this point ...

It should be noted that, the above is just for my first try on a minimal sub-problem, where the scalability might be unfortunately terrible (since the size of x explodes if we have a longer vec or a larger grpNr).

I will keep thinking about the potential improvement, and also would like to see contributions from other genius answers, particularly targeting on the entire problem. :)

Amoebocyte answered 21/8, 2023 at 23:4 Comment(1)
Hi Thomas! Thank you so much for your answer! Small update - Reinderien pointed out that continguousness as a condition could lead to jerrymandered messes of areas, so if you'd prefer, compactness is another optionDilapidate

© 2022 - 2025 — McMap. All rights reserved.