I found an interesting question and I would love to hear your interpretation.
from patsy import dmatrix,demo_data
df = pd.DataFrame(demo_data("a", "b", "x1", "x2", "y", "z column"))
Patsy_Standarlize_Output = dmatrix("standardize(x2) + 0",df).ravel()
output = (df['x2'] - df['x2'].mean()) / df['x2'].std()
Pandas_Standarlize_Output = output.ravel()
if you print out results for standardized x2 columns, you will find out the result is quite different. The result is as follow:
Patsy_Standarlize_Output = [-1.21701061, -0.07791372, -0.66884723, 2.23584028, 0.69898536, -0.71843674, -0.00416815, -0.2484492 ]
Pandas_Standarlize_Output = [-1.13840918, -0.07288161, -0.62564929, 2.09143707, 0.65384094, -0.67203603, -0.00389895, -0.23240294]
My question is since I conducted the standardization for the same column, Why the results are different?
I am looking forward to hear your great interpretation! and Thank you so much for your time and help!
pandas' std()
performs Bessel correction, while most other libraries don't. It practically doesn't matter once you have several dozen points, but for small samples it is a very reasonable thing to do.
Proof: if you replace df['x2'].std()
with numpy version (df['x2'].values.std()
), the results will match