I have a dataset where I need to be able to control to what extent the Outlier Detection Model (Isolation Forest, Elliptic Envelope, OneClassSVM...) considers a given point an outlier or not (something similar to the Z-score or IQR-score). This means that I do not want to specify in advance the percentage of outlier points in my dataset, better known as the contamination
parameter, but I want this percentage to depend on how "picky" I want my model to be. Is this the same as setting the parameter contamination
to 'auto'?
Here's what the Sci-kit Learn package says about this: "if ‘auto’, the threshold is determined as in the original paper".
Which original paper does this refer to? And does setting the contamination
parameter to 'auto' solve my problem?
I was looking at the paper without much success, but looking at the code gave me the answer. Note this part of the implementation:
if self.contamination == "auto":
# 0.5 plays a special role as described in the original paper.
# we take the opposite as we consider the opposite of their score.
self.offset_ = -0.5
return self
# else, define offset_ wrt contamination parameter
self.offset_ = np.percentile(self.score_samples(X),
100. * self.contamination)
You can check the full implementation here.
When you set the contamination='auto'
the offset_
value, which impacts in the prediction of your model, is set to -0.5
, while if you use a float
value in the contamination
parameter the offset value will vary to achieve the percentage of contamination that you previously passed. So the model will determine the percentage of contamination in your data based on this decision.