pythonmachine-learningscikit-learnisolation-forestonline-machine-learning

How can update trained IsolationForest model with new datasets/datafarmes in python?


Let's say I fit IsolationForest() algorithm from scikit-learn on time-series based Dataset1 or dataframe1 df1 and save the model using the methods mentioned here & here. Now I want to update my model for new dataset2 or df2.

My findings:

...learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time, there will be only a small amount of instances in the main memory. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve tuning.

but Sadly IF algorithm doesn't support estimator.partial_fit(newdf)

How I can update the trained on Dataset1 and saved IF model with a new Dataset2?


Solution

  • You can simply reuse the .fit() call available to the estimator on the new data.

    This would be preferred, especially in a time series, as the signal changes and you do not want older, non-representative data to be understood as potentially normal (or anomalous).

    If old data is important, you can simply join the older training data and newer input signal data together, and then call .fit() again.

    Also sidenote, according to sklearn documentation, it is better to use joblib than pickle

    An MRE with resources below:

    # Model
    from sklearn.ensemble import IsolationForest
    
    # Saving file
    import joblib
    
    # Data
    import numpy as np
    
    # Create a new model
    model = IsolationForest()
    
    # Generate some old data
    df1 = np.random.randint(1,100,(100,10))
    # Train the model
    model.fit(df1)
    
    # Save it off
    joblib.dump(model, 'isf_model.joblib')
    
    # Load the model
    model = joblib.load('isf_model.joblib')
    
    # Generate new data
    df2 = np.random.randint(1,500,(1000,10))
    
    # If the original data is now not important, I can just call .fit() again.
    # If you are using time-series based data, this is preferred, as older data may not be representative of the current state
    model.fit(df2)
    
    # If the original data is important, I can simply join the old data to new data. There are multiple options for this:
    # Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
    # Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html
    
    combined_data = np.concatenate((df1, df2))
    model.fit(combined_data)