pythonlinear-programmingquadratic-programmingquadprog

Find linear combination of vectors that is the best fit for a target vector


I am trying to find weights across a number of forecasts to give a result that is as close as possible (say, mean squared error) to a known target.

Here is a simplified example showing three different types of forecast across four data points:

target = [1.0, 1.02, 1.01, 1.04]  # all approx 1.0
forecasts = [
    [0.9, 0.91, 0.92, 0.91],  # all approx 0.9
    [1.1, 1.11, 1.13, 1.11],  # all approx 1.1
    [1.21, 1.23, 1.21, 1.23]  # all approx 1.2
]

where one forecast is always approximately 0.9, one is always approximately 1.1 and one is always approximately 1.2.

I'd like an automated way of finding weights of approximately [0.5, 0.5, 0.0] for the three forecasts because averaging the first two forecasts and ignoring the third is very close to the target. Ideally the weights would be constrained to be non-negative and sum to 1.

I think I need to use some form of linear programming or quadratic programming to do this. I have installed the Python quadprog library, but I'm not sure how to translate this problem into the form that solvers like this require. Can anyone point me in the right direction?


Solution

  • If I understand you correctly, you want to model some optimization problem and solve it. If you are interested in the general case (without any constraints), your problem seems pretty close to the regular least square error problem (which you can solve with scikit-learn for example).

    I recommend to use cvxpy library for modeling an optimization problem. It's a convenient way to model a convex optimization problem, and you can choose which solver you want to work in the background.

    Expanding cvxpy least square example, by adding the constraints you mentioned:

    # Import packages.
    import cvxpy as cp
    import numpy as np
    
    # Generate data.
    m = 20
    n = 15
    np.random.seed(1)
    A = np.random.randn(m, n)
    b = np.random.randn(m)
    
    # Define and solve the CVXPY problem.
    x = cp.Variable(n)
    cost = cp.sum_squares(A @ x - b)
    prob = cp.Problem(cp.Minimize(cost), [x>=0, cp.sum(x)==1])
    prob.solve()
    
    # Print result.
    print("\nThe optimal value is", prob.value)
    print("The optimal x is")
    print(x.value)
    print("The norm of the residual is ", cp.norm(A @ x - b, p=2).value)
    

    In this example, A (the matrix) is a matrix of all your vector, x (the variable) is the weights, and b is the known target.

    EDIT: example with your data:

    forecasts = np.array([
        [0.9, 0.91, 0.92, 0.91],
        [1.1, 1.11, 1.13, 1.11],
        [1.21, 1.23, 1.21, 1.23]
    ])
    
    target = np.array([1.0, 1.02, 1.01, 1.04])
    x = cp.Variable(forecasts.shape[0])
    cost = cp.sum_squares(forecasts.T @ x - target)
    prob = cp.Problem(cp.Minimize(cost), [x >= 0, cp.sum(x) == 1])
    prob.solve()
    print("\nThe optimal value is", prob.value)
    print("The optimal x is")
    print(x.value)
    

    Output:

    The optimal value is 0.0005306233766233817
    The optimal x is
    [ 6.52207792e-01 -1.45736370e-24  3.47792208e-01]
    

    results are approximately [0.65, 0, 0.34] which is different from the [0.5, 0.5, 0.0] you mentioned, but that depends on how you define your problem. This is a solution for the least squares error.