pythondataframeinterpretation

Python Regression Tree interpretation


This is my code:

regr = DecisionTreeRegressor(max_depth = 2)

regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('DT: mse = '+ str(mse) + ' r2 = '+ str(r2))
->result: DT: mse = 0.6600129794020736 r2 = 0.46983848613583734

sTree = export_text(regr, feature_names = list(X_train.columns))
#here is a mistake
#it says: 'numpy.ndarray' object has no attribute 'columns'

plt.figure()
plot_tree(regr, filled = True, feature_names=list(X.columns),fontsize = 9)
plt.savefig('tree.pdf')

I have 2 questions. First of all, there is a mistake in the sTree line, as I added. It would be great if you could tell me my mistake in that line. And my second problem is that I do not know, if this regression tree is good and efficient. How do I interpret a regression tree?


Solution

  • Addressing your comment: MSE is the average of each predicted value minus the actual value raised to the power of two. Intuitively, it's the average of the squared errors your model made. The closer to zero, the better. R2 is the proportion of the variance in the target that can be explained by your model. The closer to 1, the better your model is.

    Now, the first error is popping up because apparently X_train is a numpy multidimensional array and therefore does not have a columns attribute. Try with X_train.dtype.names instead.

    Second, the performance metrics used to define how well your model fits your data are entirely up to you and the context of the problem you are facing. For example, in natural sciences, such as physics, you may want an R2 metric above 0.9, but perhaps in social sciences, like economics, an R2 of 0.5 may suffice. Like I said, it depends on the context of your data. Your mse is 0.66, which to me sounds reasonable, but maybe this error is huge given the range of your dependent variable.

    One thing you can do is test the stability of your results. If the R2 on the training and testing data differs greatly, your decision tree is probably overfitting on the training dataset. To solve this, you may want to increase your test size, use k-fold cross validation, retrain your model with another set of hyperparameters, switch to another algorithm, add more features or reduce the dimensionality of your problem by dropping irrelevant independent variables.

    I recommend assessing the MSE in the context of your independent variable and comparing the performance metrics of multiple algorithms, such as a logistic regression vs random forest vs gradient boosting, etc. and choose the one you like most.

    Take a look at all these performance metrics (available in sklearn.metrics).