pythonscikit-learnstatisticsoutliersanomaly-detection

What does setting the 'contamination' parameter to 'auto' in Sklearn Outlier Detection methods do?


I have a dataset where I need to be able to control to what extent the Outlier Detection Model (Isolation Forest, Elliptic Envelope, OneClassSVM...) considers a given point an outlier or not (something similar to the Z-score or IQR-score). This means that I do not want to specify in advance the percentage of outlier points in my dataset, better known as the contamination parameter, but I want this percentage to depend on how "picky" I want my model to be. Is this the same as setting the parameter contamination to 'auto'?

Here's what the Sci-kit Learn package says about this: "if ‘auto’, the threshold is determined as in the original paper".

Which original paper does this refer to? And does setting the contamination parameter to 'auto' solve my problem?


Solution

  • I was looking at the paper without much success, but looking at the code gave me the answer. Note this part of the implementation:

        if self.contamination == "auto":
            # 0.5 plays a special role as described in the original paper.
            # we take the opposite as we consider the opposite of their score.
            self.offset_ = -0.5
            return self
    
        # else, define offset_ wrt contamination parameter
        self.offset_ = np.percentile(self.score_samples(X),
                                     100. * self.contamination)
    

    You can check the full implementation here.

    When you set the contamination='auto' the offset_ value, which impacts in the prediction of your model, is set to -0.5, while if you use a float value in the contamination parameter the offset value will vary to achieve the percentage of contamination that you previously passed. So the model will determine the percentage of contamination in your data based on this decision.