pythonpandasdata-processingpatsystandardization

Standardization Result is different between Patsy & Pandas - Python


I found an interesting question and I would love to hear your interpretation.

from patsy import dmatrix,demo_data
df = pd.DataFrame(demo_data("a", "b", "x1", "x2", "y", "z column"))

Patsy_Standarlize_Output = dmatrix("standardize(x2) + 0",df).ravel()
output = (df['x2'] - df['x2'].mean()) / df['x2'].std()
Pandas_Standarlize_Output = output.ravel()

if you print out results for standardized x2 columns, you will find out the result is quite different. The result is as follow:

Patsy_Standarlize_Output = [-1.21701061, -0.07791372, -0.66884723, 2.23584028, 0.69898536, -0.71843674, -0.00416815, -0.2484492 ]

Pandas_Standarlize_Output = [-1.13840918, -0.07288161, -0.62564929, 2.09143707, 0.65384094, -0.67203603, -0.00389895, -0.23240294]

My question is since I conducted the standardization for the same column, Why the results are different?

I am looking forward to hear your great interpretation! and Thank you so much for your time and help!


Solution

  • pandas' std() performs Bessel correction, while most other libraries don't. It practically doesn't matter once you have several dozen points, but for small samples it is a very reasonable thing to do.

    Proof: if you replace df['x2'].std() with numpy version (df['x2'].values.std()), the results will match