I have a very large mixed-integer linear programming problem that I need to run thousands of times, so speed is a priority.
I need it to run in both R
and in python
. My R
code runs very fast, the solver takes about 0.002 seconds to solve the problem. My python
code is a different story, it takes anywhere from of 0.05 to 0.80 seconds and often times much more.
I highly doubt that R
is 25x more efficient than python
, especially considering that (to my knowledge) both are using C
code to solve the problem. I'm looking for help to determine how to make my python
code faster. I assume it starts with getting a different package than the one I have now? I really need the python
code to consistently be below .01 seconds.
Here is sample R
code and python
code that solve the same problem. In this example of my real problem, I'm building "fantasy basketball" lineups with certain constraints. Player pool is 500, must choose exactly 8 players with at least 1 and no more than 3 from each position. Salary must be under 50000. I randomly sample Positions/Salary since the actual solution isn't important as much as the speed.
R Code
library(lpSolve)
library(dplyr)
# Make dummy dataframe with player positions, salaries, and projected points
Positions <- c('PG', 'SG', 'SF', 'PF', 'C', 'PG/SG', 'SG/SF', 'PF/C')
Position <- sample(Positions, 500, replace = TRUE)
Salary <- sample(seq(3000, 12000, by=100), 500, replace=TRUE)
playerTable <- data.frame(ID = 1:500, Position, Salary) %>%
mutate(ProjPts = round(Salary/1000*5))
# Make positional constraint vectors
ConVec_Position <- playerTable %>%
select(Position) %>%
mutate(AnyPG = grepl('PG', Position)*1,
AnySG = grepl('SG', Position)*1,
AnySF = grepl('SF', Position)*1,
AnyPF = grepl('PF', Position)*1,
AnyC = grepl('C', Position)*1,
orPGSGSFPFC = (AnyPG + AnySG + AnySF + AnyPF + AnyC > 0)*1
) %>%
select(-Position)
# Make salary constraint vectors
ConVec_Salary <- playerTable$Salary
# Get Objective (projected points)
ProjectedPoints <- playerTable$ProjPts
# Make constraint values and directions
ConValGT <- c(1,1,1,1,1, 8) # Must have at least one of each position, and 8 total
ConValLT <- c(3,3,3,3,2, 8, 50000) # Can't have more than 3 of each position, 8 total, and 50000 salary
ConDirGT <- c(rep(">=", ncol(ConVec_Position))) # Make >= directions for each position constraint
ConDirLT <- c(rep("<=", ncol(ConVec_Position) + 1)) # Make <= directions for each position constraint, plus 1 for Salary
# Compile constraint vectors, values, and directions
ConVec_Final <- cbind(ConVec_Position, ConVec_Position, ConVec_Salary)
ConDir_Final <- c(ConDirGT, ConDirLT)
ConVal_Final <- c(ConValGT, ConValLT)
start_time <- proc.time()
sol <- lp("max",
objective.in = ProjectedPoints,
const.mat = t(ConVec_Final),
const.dir = ConDir_Final,
const.rhs = ConVal_Final,
binary.vec = 1:length(ProjectedPoints) # All decisions are binary
)
cat("\n--- It took ", round((proc.time() - start_time)[3], 4), " seconds to optimize lineups ---\n", sep = "")
python Code
import pandas as pd
import numpy as np
from lp_solve import *
import time
# Make dummy dataframe with player positions, salaries, and projected points
ID = range(500)
Positions = ['PG', 'SG', 'SF', 'PF', 'C', 'PG/SG', 'SG/SF', 'PF/C']
Position = np.random.choice(Positions,500)
Salary = np.random.choice(range(3000,12000,100),500)
playerTable = pd.DataFrame({'ID': ID, 'Position': Position, 'Salary': Salary})
playerTable['ProjPts'] = playerTable['Salary']/1000*5
# Make positional constraint vectors
AnyPG = playerTable.Position.str.contains('PG')*1
AnySG = playerTable.Position.str.contains('SG')*1
AnySF = playerTable.Position.str.contains('SF')*1
AnyPF = playerTable.Position.str.contains('PF')*1
AnyC = playerTable.Position.str.contains('C')*1
AnyPos = [1]*500
ConVec_Position = pd.DataFrame({'AnyPG': AnyPG, 'AnySG': AnySG, 'AnySF': AnySF, 'AnyPF': AnyPF, 'AnyC': AnyC, 'AnyPos': AnyPos})
# Make salary constraint vectors
ConVec_Salary = playerTable[['Salary']]
# Get Objective (projected points)
ProjectedPoints = playerTable['ProjPts'].tolist()
# Make constraint values and directions
ConValGT = [1,1,1,1,1, 8] # Must have at least one of each position, and 8 total
ConValLT = [3,3,3,3,2, 8, 50000] # Can't have more than 3 of each position, 8 total, and 50000 salary
ConDirGT = [1] * (len(ConVec_Position.columns)) # Make >= directions for each position constraint
ConDirLT = [-1] * (len(ConVec_Position.columns) + 1) # Make <= directions for each position constraint, plus 1 for Salary
# Compile constraint vectors, values, and directions
ConVec_Final = pd.concat([ConVec_Position, ConVec_Position, ConVec_Salary], axis=1)
ConVal_Final = ConValGT + ConValLT
ConDir_Final = ConDirGT + ConDirLT
# Force all decisions to be binary
vLB = [0] * len(ProjectedPoints) # lower bound 0
vUB = [1] * len(ProjectedPoints) # upper bound 1
xint = [i+1 for i in range(len(ProjectedPoints))] # all decisions are integers (aka all are binary)
solveTimefix = time.time()
[obj, sol, duals] = lp_solve(ProjectedPoints,
ConVec_Final.T.values.tolist(),
ConVal_Final,
ConDir_Final,
vLB,
vUB,
xint)
solveTimeCurrRun = (time.time() - solveTimefix)
print("\n--- It took %s seconds to get all LUs ---\n" % round(solveTimeCurrRun, 2))
Any ideas on how to make my python
code run the same speed as R
?
On my craptop this takes about 4 ms, but importantly it is not apples-to-apples. I believe that you are missing a (large) constraint: each individual player can only be assigned to one position.
from time import perf_counter
import pandas as pd
import numpy as np
import scipy.sparse
from scipy.optimize import milp, LinearConstraint, Bounds
from numpy.random._generator import default_rng
rand = default_rng(seed=0)
# Make dummy dataframe with player positions, salaries, and projected points
n = 500
position_combos = ('PG', 'SG', 'SF', 'PF', 'C', 'PG/SG', 'SG/SF', 'PF/C')
salary_k = rand.integers(low=3, high=12, size=n)
players = pd.DataFrame(
{
'Positions': rand.choice(position_combos, n),
'Salary': salary_k * 1_000,
'ProjPts': salary_k * 5,
},
index=pd.RangeIndex(name='ID', stop=n),
) # 500x3
# Convert from slash-separated position option strings to a dense ID/Position multi-index
exploded_positions = players.Positions.str.split('/', expand=True) # 500x2
position_values = np.array(('PG', 'SG', 'SF', 'PF', 'C'))
position_sparse = pd.DataFrame(
(
exploded_positions.values[:, :, np.newaxis]
== position_values[np.newaxis, np.newaxis, :]
).any(axis=1),
index=players.index,
columns=pd.Index(name='Position', data=position_values),
) # 500x5
position_sparse[~position_sparse] = np.nan
positions = position_sparse.stack().index # 685x2
# Sparse 2D position assignment matrix with guaranteed single one per column
position_assignments = np.equal.outer(
positions.get_level_values('Position'),
position_values,
).astype(float).T # 5x685
# Sparse 2D player assignment matrix with guaranteed single one per column
player_assignments = np.equal.outer(
positions.get_level_values('ID'),
players.index,
).astype(float).T # 500x685
# Dense salary assignment (row) with repeats according to player ID
salary_assignments = players.Salary[positions.get_level_values('ID')] # 685
# Cost values: negative projected points, only where a player can take a given position;
# multi-indexed by ID and Position
c = -players.ProjPts[positions.get_level_values('ID')].astype(float)
c.index = positions # 685
integrality = np.ones(shape=c.size, dtype=np.uint8) # 685
bounds = Bounds(
lb=np.zeros_like(c),
ub=np.ones_like(c), # 685
)
A = scipy.sparse.csc_array(
scipy.sparse.vstack((
scipy.sparse.csr_array(np.ones_like(c)), # The total number of selected players must be fixed
scipy.sparse.csr_array(salary_assignments), # There is a maximum salary
scipy.sparse.csr_array(position_assignments), # Each position count has a min and max
scipy.sparse.csr_array(player_assignments), # One player cannot occupy multiple positions
), format='csc') # this is what milp uses; see _milp_iv()
) # 507x685
lb, ub = np.block([
[8, -np.inf, 1, 1, 1, 1, 1, np.full(shape=n, fill_value=-np.inf)],
[8, 50_000, 3, 3, 3, 3, 2, np.ones(n)],
])
t0 = perf_counter()
result = milp(
c=c, integrality=integrality, bounds=bounds,
constraints=(LinearConstraint(A=A, lb=lb, ub=ub),),
)
t1 = perf_counter()
assert result.success, result.message
print(result.message)
print(f'Solution in {1e3*(t1-t0):.1f} ms')
outputs = pd.Series(name='Assigned', data=result.x, index=positions)
selections = outputs[outputs > 0.5].index
merged_outputs = players.loc[selections.get_level_values('ID')].set_index(selections)
print(merged_outputs, end='\n\n')
print('Total:')
print(merged_outputs[['Salary', 'ProjPts']].sum().to_string())
This uses HiGHS in the backend; I have not attempted lpsolve55
. The output is
Optimization terminated successfully. (HiGHS Status 7: Optimal)
Solution in 3.4 ms
Positions Salary ProjPts
ID Position
314 SF SF 10000 50
390 PG PG 3000 15
447 SG SG 3000 15
467 SF SF 10000 50
480 PF PF 7000 35
484 SG SG 3000 15
487 C C 7000 35
494 C C 7000 35
Total:
Salary 50000
ProjPts 250
You could also try running on PuLP/CBC (I have not shown that here).