pythonmachine-learningregression

Is it possible to apply a single regression technique to data that has different patterns?


I want to estimate the amount of sales for multiple different products depending on the temperature, which some products have a relationship between. For one of the products, the relationship between sales and temperature looks like this when plotted out:

Plot made using matplotlib.pyplot in Python

This is just one product, but here there is a general trend that from after 10 degrees the amount of sales increases. For other products the relationship might be more linear, and others might have a polynomial kind of relationship, while other products might not have a relationship at all. An example of another product, which has no correlation between sales and temperature could be this product:

matplotlib plot

First of all i wanted to predict something just from one product, so I used the product from the first plot to try and model something. What I ended up doing was splitting the data so I had a dataframe with all the values from -5 degrees to 10 degrees and performed linear regression, and similar i splitted from 10 degrees to 30 degrees to perform linear regression, like so:

enter image description here

The one problem here is that I'm doing all sorts of things to fit my data to only ONE product. I have a dataset of 1000 products where I'd want to be able to estimate sales for SOME of the products based of a temperature. I want to somehow loop through all of my datasets, figure out which ones has some kind of relationship between sales and temperature and then automatically apply the best regression model for that particular product to estimate amount of sales for that product given some temperature, X.

I have looked at a bunch of different regression tutorials for neural networks, but I simply have no idea how to start or what to search for, or if what I'm trying to do is even possible?


Solution

  • Here is an example of using scipy's differential_evolution genetic algorithm to fit a single data set to two different overlapping straight lines, and also automatically find the breakpoint to switch from one model to another. The scipy implementation of Differential Evolution uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, which requires bounds within which to search - in this example, those bounds are taken from the data max and min values. The example completes the fitting with a call to curve_fit() without passing any bounds just in case the optimal parameters are outside the bounds used for the genetic algorithm. dual.png

    import numpy, scipy, matplotlib
    import matplotlib.pyplot as plt
    from scipy.optimize import curve_fit
    from scipy.optimize import differential_evolution
    import warnings
    
    xData = numpy.array([19.1647, 18.0189, 16.9550, 15.7683, 14.7044, 13.6269, 12.6040, 11.4309, 10.2987, 9.23465, 8.18440, 7.89789, 7.62498, 7.36571, 7.01106, 6.71094, 6.46548, 6.27436, 6.16543, 6.05569, 5.91904, 5.78247, 5.53661, 4.85425, 4.29468, 3.74888, 3.16206, 2.58882, 1.93371, 1.52426, 1.14211, 0.719035, 0.377708, 0.0226971, -0.223181, -0.537231, -0.878491, -1.27484, -1.45266, -1.57583, -1.61717])
    yData = numpy.array([0.644557, 0.641059, 0.637555, 0.634059, 0.634135, 0.631825, 0.631899, 0.627209, 0.622516, 0.617818, 0.616103, 0.613736, 0.610175, 0.606613, 0.605445, 0.603676, 0.604887, 0.600127, 0.604909, 0.588207, 0.581056, 0.576292, 0.566761, 0.555472, 0.545367, 0.538842, 0.529336, 0.518635, 0.506747, 0.499018, 0.491885, 0.484754, 0.475230, 0.464514, 0.454387, 0.444861, 0.437128, 0.415076, 0.401363, 0.390034, 0.378698])
    
    
    def func(xArray, breakpoint, slopeA, offsetA, slopeB, offsetB):
        returnArray = []
        for x in xArray:
            if x < breakpoint:
                returnArray.append(slopeA * x + offsetA)
            else:
                returnArray.append(slopeB * x + offsetB)
        return returnArray
    
    
    # function for genetic algorithm to minimize (sum of squared error)
    def sumOfSquaredError(parameterTuple):
        warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
        val = func(xData, *parameterTuple)
        return numpy.sum((yData - val) ** 2.0)
    
    
    def generate_Initial_Parameters():
        # min and max used for bounds
        maxX = max(xData)
        minX = min(xData)
        maxY = max(yData)
        minY = min(yData)
        slope = 10.0 * (maxY - minY) / (maxX - minX) # times 10 for safety margin
    
        parameterBounds = []
        parameterBounds.append([minX, maxX]) # search bounds for breakpoint
        parameterBounds.append([-slope, slope]) # search bounds for slopeA
        parameterBounds.append([minY, maxY]) # search bounds for offsetA
        parameterBounds.append([-slope, slope]) # search bounds for slopeB
        parameterBounds.append([minY, maxY]) # search bounds for offsetB
    
    
        result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
        return result.x
    
    # by default, differential_evolution completes by calling curve_fit() using parameter bounds
    geneticParameters = generate_Initial_Parameters()
    
    fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
    print('Parameters:', fittedParameters)
    print()
    
    modelPredictions = func(xData, *fittedParameters) 
    
    absError = modelPredictions - yData
    
    SE = numpy.square(absError) # squared errors
    MSE = numpy.mean(SE) # mean squared errors
    RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
    Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
    
    print()
    print('RMSE:', RMSE)
    print('R-squared:', Rsquared)
    
    print()
    
    
    ##########################################################
    # graphics output section
    def ModelAndScatterPlot(graphWidth, graphHeight):
        f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
        axes = f.add_subplot(111)
    
        # first the raw data as a scatter plot
        axes.plot(xData, yData,  'D')
    
        # create data for the fitted equation plot
        xModel = numpy.linspace(min(xData), max(xData))
        yModel = func(xModel, *fittedParameters)
    
        # now the model as a line plot
        axes.plot(xModel, yModel)
    
        axes.set_xlabel('X Data') # X axis data label
        axes.set_ylabel('Y Data') # Y axis data label
    
        plt.show()
        plt.close('all') # clean up after using pyplot
    
    graphWidth = 800
    graphHeight = 600
    ModelAndScatterPlot(graphWidth, graphHeight)