pythontensorflowresuming-training

how continue recording evolution trainning on tensorflow model in same history file eachtime


I train a big tensorflow model with a lot of data. I need to stop/save and reload it to continu trainning on new data.

If I save the history file, can I (and how?) start again the trainning and continu compilling result on the same history file?

My two reasons:

Keep complete evolution tracking of my model trainning. make save of the best model of all time, not best of the last trainning session who could be worst (bad data...) than the last trainning sessions.

Thank for your input!


Solution

  • It is a bit complicated but here is how to do it. First you train your model for a certain number of epochs with

    history=model.fit(epoch= etc....}
    after training is complete save your model.
    

    history is a history object which has an attribute .history. history.history is a dictionary that contains the information from model.fit. Since it is a dictionary it has keys and associated with the key is a list that represents the values. What the keys are and how many keys there are depends on two things. One is if your trained your model with or without a validation set. If you trained and had a validation set then there are keys that refer to the train data and keys that refer to the validation data. The other thing is what metrics you specified when you compiled your model. Lets assume you trained without a validation set and did not specify any metrics. Then the dictionary has only a single key called 'loss' and the values associated with the key is a list of training loss for each epoch. Now supposed you trained you model and included a validation set but did not specify and metrics. Then the dictionary will have 2 keys. One key is loss as before. Then second key is "val_loss" and the values associated with that key is a list of validation losses for each epoch. Now consider the case where you included a validation set and also specified "accuracy" as a metric when you compiled your model. Now there will be 4 keys which are 'loss', 'val_loss', 'accuracy' and 'val_accuracy'. Each key has a list of values of the data for each epoch. Now what you want to do first is after you finish your first training session is to save the dictionary data. The easiest way to do that is to save the data as a csv file. The function below will do that

    def save_history_to_csv(history,cvspath):
        trdict=history.history
        keys=list(trdict.keys())
        df=pd.DataFrame()
        for key in keys:
            data=list(trdict[key])
            print (key, data)
            df[key]=data
        print (df.head())
        df.to_csv(csvpath, index=False)
    

    to use the function define the full path to where to save the csv file

    csvpath=r'c:\temp\history.csv'
    save_history_to_csv(history, csvpath)
    

    Now the data from the first training run is saved. The csv file column heading are the keys and the data below the column is the history data for each column. So if you trained for 5 epochs the csv file contains a header row and 5 rows of data one row for each epoch. Below is an example of the csv file where I trained including a validation set and in model.compile I specified the metrics as 'accuracy', and 'auc'

           loss  accuracy         auc  val_loss  val_accuracy   val_auc
    0  6.949288  0.716548    0.784296  5.912256         0.690   0.773013
    1  4.508249  0.835709    0.919206  4.226338         0.755   0.836487
    2  3.391246  0.881235    0.952171  3.086797         0.855   0.915100
    3  2.675472  0.925178    0.973569  2.519850         0.875   0.923925
    4  2.146932  0.948931    0.988977  2.145782         0.830   0.916100
    

    now with this data stored, reload your saved model and train the model again. You now have a new history.history dictionary holding the data for the second training session. Now we need to append this data to our saved csv file. The function that will do that for you is shown below

    def update_csv(history, csvpath):
        # read in the saved csv file and create a dataframe
        stored_df=pd.read_csv(csvpath)
        trdict=history.history
        keys=list(trdict.keys())
        df=pd.DataFrame()
        for key in keys:
            data=list(trdict[key])
            df[key]=data
        new_df=pd.concat([stored_df, df], axis=0).reset_index(drop=True)
        new_df.to_csv(csvpath, index=False)
    

    the csvpath is the same path as you used before. so just run

    update_csv(history, csvpath)
    

    The updated csv file now has both the data from the first training session and the data from the last training session. If you ran the second training session for say 5 epochs than the csv file contains a header row and 10 rows of data. You can repeat this process as many times as you care to. When you are done with all the training runs then you can read in the csv files with

    df=pd.read_csv(csvpath)
    

    now the training data is in a dataframe that you can examine and plot. Hope this helped.