apache-sparkmachine-learningpysparkapache-spark-mllib

PySpark LinearRegressionWithSGD, model predict dimensions mismatch


I've come across the following error:

AssertionError: dimension mismatch

I've trained a linear regression model using PySpark's LinearRegressionWithSGD. However when I try to make a prediction on the training set, I get "dimension mismatch" error.

Worth mentioning:

  1. Data was scaled using StandardScaler, but the predicted value was not.
  2. As can be seen in code the features used for training were generated by PCA.

Some code:

pca_transformed = pca_model.transform(data_std)
X = pca_transformed.map(lambda x: (x[0], x[1]))
data = train_votes.zip(pca_transformed)
labeled_data = data.map(lambda x : LabeledPoint(x[0], x[1:]))
linear_regression_model = LinearRegressionWithSGD.train(labeled_data, iterations=10)

The prediction is the source of the error, and these are the variations I tried:

pred = linear_regression_model.predict(pca_transformed.collect())
pred = linear_regression_model.predict([pca_transformed.collect()])    
pred = linear_regression_model.predict(X.collect())
pred = linear_regression_model.predict([X.collect()])

The regression weights:

DenseVector([1.8509, 81435.7615])

The vectors used:

pca_transformed.take(1)
[DenseVector([-0.1745, -1.8936])]

X.take(1)
[(-0.17449817243564397, -1.8935926689554488)]

labeled_data.take(1)
[LabeledPoint(22221.0, [-0.174498172436,-1.89359266896])]

Solution

  • This worked:

    pred = linear_regression_model.predict(pca_transformed)
    

    pca_transformed is of type RDD.

    The function handles RDD's and arrays differently:

    def predict(self, x):
        """
        Predict the value of the dependent variable given a vector or
        an RDD of vectors containing values for the independent variables.
        """
        if isinstance(x, RDD):
            return x.map(self.predict)
        x = _convert_to_vector(x)
        return self.weights.dot(x) + self.intercept
    

    When a simple array is used, there might be a dimension mismatch issue (like the error in the question above).

    As can be seen, if x is not an RDD, it's being converted to a vector. The thing is the dot product will not work unless you take x[0].

    Here is the error reproduced:

    j = _convert_to_vector(pca_transformed.take(1))
    linear_regression_model.weights.dot(j) + linear_regression_model.intercept
    

    This works just fine:

    j = _convert_to_vector(pca_transformed.take(1))
    linear_regression_model.weights.dot(j[0]) + linear_regression_model.intercept