pysparksasanova

Looks like meanSquaredError from pyspark.ml.regression LinearRegression is incorrect


I'm converting code from SAS to PySpark, and trying to replicate a call to proc anova. The meanSquaredError from a LinearRegression fit doesn't seem to be correct. When I use the summary.predictions to calculate it by hand, I get the right value. Does anyone know if this is a bug or if it's calculated differently somehow? The value from meanSquaredError is 0.01980 vs 0.02004 in SAS. I can work on an example, if that's helpful.

Edited to add an example:

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.sql.functions import expr

# df is the sashelp.class data set from sas

# StringIndexer for sex
indexer = StringIndexer(inputCol="sex", outputCol="sex_index")
df_indexed = indexer.fit(df).transform(df)

# OneHotEncoder for sex_index
encoder = OneHotEncoder(inputCol="sex_index", outputCol="sex_ohe", dropLast=True)
df_encoded = encoder.fit(df_indexed).transform(df_indexed)

# VectorAssembler with one-hot encoded sex and weight
assembler = VectorAssembler(inputCols=["sex_ohe"], outputCol="features")
df_features = assembler.transform(df_encoded).select("features", "weight")

# Fit a Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="weight", fitIntercept=True)
lr_model = lr.fit(df_features)

# Calculate degrees of freedom
n = df_features.count()
k = df_encoded.select("sex_ohe").first()["sex_ohe"].size  # one-hot sex
df_regression = k
df_residual = n - k - 1
df_total = n - 1

# Get the sums of squares
ss = lr_model.summary.predictions\
    .withColumn("mean_pred", expr("mean(weight) over()"))\
    .selectExpr("round(sum(POWER(prediction - weight, 2)), 5) as sse",
                "round(sum(POWER(mean_pred - weight, 2)), 5) as sst",
                "round(sum(POWER(mean_pred - prediction, 2)), 5) as ssr")
sse=ss.toPandas()["sse"].tolist()[0]
sst = ss.toPandas()["sst"].tolist()[0]
ssr = ss.toPandas()["ssr"].tolist()[0]

# Calculate F-statistic
msr = ssr / df_regression
mse = sse / df_residual
f_value = round(msr / mse, 2)

anova_table = {
    "Degrees of Freedom Regression": df_regression,
    "Sums of Squares Regression": ssr,
    "Mean Square Regression": round(msr, 5),
    "Degrees of Freedom Residual": df_residual,
    "Sums of Squares Error": sse,
    "Mean Squared Error": round(mse, 5),
    "Degrees of Freedom Total": df_total,
    "Sums of Squares Total": sst,
    "F-Statistic": f_value
}

print(lr_model.summary.meanSquaredError)
anova_table

The results are:

402.8744152046783
{'Degrees of Freedom Regression': 1,
 'Sums of Squares Regression': 1681.12295,
 'Mean Square Regression': 1681.12295,
 'Degrees of Freedom Residual': 17,
 'Sums of Squares Error': 7654.61389,
 'Mean Squared Error': 450.27141,
 'Degrees of Freedom Total': 18,
 'Sums of Squares Total': 9335.73684,
 'F-Statistic': 3.73}

SAS Results


Solution

  • This is an expected behavior as pyspark’s .summary.meanSquaredError does not use sample degrees for freedom. Instead it calculates mse as [ meanSquaredError = SSE / n ] here n just the number of observations and is not [ n - p - 1 ] (degress of freedom). So it is actually reporting the Mean Squared Deviation rather than the classical Mean Squared Error.
    Note that it's not a bug.