[SOLVED] Resume from checkpoint with Accelerator causes loss to increase

Resume from checkpoint with Accelerator causes loss to increase

I've been working on a project to attempt to finetune Stable Diffusion and introduce layout conditioning. I'm using all the components of Stable diffusion from the huggingface stable diffusion pipeline as frozen and only the Unet and my custom model called LayoutEmbeddeder for the conditioning are not frozen.

I've managed to adapt some code to my needs and training the model however, my code crashed during execution and although I think I have implemented the checkpointing correctly when I resume training the loss is very mush higher and the images generated are pure noise compared to what I was logging during training.

So it looks like maybe I'm not saving the checkpoints properly. Is anyone able to take a look and give me some advice? I'm not very expert with Accelerator and there is alot of magic going on.

Code can be found here: https://github.com/mia01/stable-diffusion-layout-finetune ( the checkpoint code is in main.py I'm using accelerate hooks) Wandb log here (you can clearly see the jump in the loss): https://api.wandb.ai/links/dissertation-project/wqq4croy

This is what the checkpoint directory looks like:

Also when I resume training the validation images I inference every x amount of steps look like this: Which is worse than the samples generated before I started the fine-tuning!

Any tips would be very much appreciated of where I went wrong! Thank you

Solution

It turns out I had not included the VAE in the accelerate prepare method and so it was not being saved in the state nor loaded in when resuming the training.