pythonmachine-learningscikit-learnlogistic-regression

How to find the importance of the features for a logistic regression model?


I have a binary prediction model trained by logistic regression algorithm. I want know which features (predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter which comes from the scikit-learn package, but I don't know whether it is enough for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.

Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction.


Solution

  • One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

    Consider this example:

    import numpy as np    
    from sklearn.linear_model import LogisticRegression
    
    x1 = np.random.randn(100)
    x2 = 4*np.random.randn(100)
    x3 = 0.5*np.random.randn(100)
    y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
    X = np.column_stack([x1, x2, x3])
    
    m = LogisticRegression()
    m.fit(X, y)
    
    # The estimated coefficients will all be around 1:
    print(m.coef_)
    
    # Those values, however, will show that the second parameter
    # is more influential
    print(np.array(np.std(X, 0))*m.coef_)
    

    An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

    m.fit(X / np.std(X, 0), y)
    print(m.coef_)
    

    Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).

    I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.