I'm converting code from SAS to PySpark, and trying to replicate a call to proc anova. The meanSquaredError from a LinearRegression fit doesn't seem to be correct. When I use the summary.predictions to calculate it by hand, I get the right value. Does anyone know if this is a bug or if it's calculated differently somehow? The value from meanSquaredError is 0.01980 vs 0.02004 in SAS. I can work on an example, if that's helpful.
Edited to add an example:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.sql.functions import expr
# df is the sashelp.class data set from sas
# StringIndexer for sex
indexer = StringIndexer(inputCol="sex", outputCol="sex_index")
df_indexed = indexer.fit(df).transform(df)
# OneHotEncoder for sex_index
encoder = OneHotEncoder(inputCol="sex_index", outputCol="sex_ohe", dropLast=True)
df_encoded = encoder.fit(df_indexed).transform(df_indexed)
# VectorAssembler with one-hot encoded sex and weight
assembler = VectorAssembler(inputCols=["sex_ohe"], outputCol="features")
df_features = assembler.transform(df_encoded).select("features", "weight")
# Fit a Linear Regression model
lr = LinearRegression(featuresCol="features", labelCol="weight", fitIntercept=True)
lr_model = lr.fit(df_features)
# Calculate degrees of freedom
n = df_features.count()
k = df_encoded.select("sex_ohe").first()["sex_ohe"].size # one-hot sex
df_regression = k
df_residual = n - k - 1
df_total = n - 1
# Get the sums of squares
ss = lr_model.summary.predictions\
.withColumn("mean_pred", expr("mean(weight) over()"))\
.selectExpr("round(sum(POWER(prediction - weight, 2)), 5) as sse",
"round(sum(POWER(mean_pred - weight, 2)), 5) as sst",
"round(sum(POWER(mean_pred - prediction, 2)), 5) as ssr")
sse=ss.toPandas()["sse"].tolist()[0]
sst = ss.toPandas()["sst"].tolist()[0]
ssr = ss.toPandas()["ssr"].tolist()[0]
# Calculate F-statistic
msr = ssr / df_regression
mse = sse / df_residual
f_value = round(msr / mse, 2)
anova_table = {
"Degrees of Freedom Regression": df_regression,
"Sums of Squares Regression": ssr,
"Mean Square Regression": round(msr, 5),
"Degrees of Freedom Residual": df_residual,
"Sums of Squares Error": sse,
"Mean Squared Error": round(mse, 5),
"Degrees of Freedom Total": df_total,
"Sums of Squares Total": sst,
"F-Statistic": f_value
}
print(lr_model.summary.meanSquaredError)
anova_table
The results are:
402.8744152046783
{'Degrees of Freedom Regression': 1,
'Sums of Squares Regression': 1681.12295,
'Mean Square Regression': 1681.12295,
'Degrees of Freedom Residual': 17,
'Sums of Squares Error': 7654.61389,
'Mean Squared Error': 450.27141,
'Degrees of Freedom Total': 18,
'Sums of Squares Total': 9335.73684,
'F-Statistic': 3.73}
This is an expected behavior as pyspark’s .summary.meanSquaredError does not use sample degrees for freedom. Instead it calculates mse as [ meanSquaredError = SSE / n ] here n just the number of observations and is not [ n - p - 1 ] (degress of freedom). So it is actually reporting the Mean Squared Deviation rather than the classical Mean Squared Error.
Note that it's not a bug.
SAS PROC ANOVA uses residual degrees of freedom (n - p - 1) for error variance estimation.
PySpark’s .summary.meanSquaredError is designed more for prediction error metrics, and assumes:
meanSquaredError = sum((y_i - ŷ_i)^2) / n