deep-learningcomputer-visionembeddingword-embeddingstable-diffusion

How do I introduce a new subject without extra training in Stable Diffusion?


Suppose I have a dataset of 1000 pokemon, I have 10 images of Pikachu, 10 images of Bulbasaur, etc. I also have a metadata specifying the name of each pokemon exactly. So from the metadata, I can know which image is Pikachu and which image is not. I want to fine-tune a Stable Diffusion model to draw pokemons with prompt like this "a drawing of [name]" where name is the name of the pokemon I want to draw. It should be fine to draw any Pokemon with well-known name in the dataset. I should probably even be able to draw Donald Trump in the style of pokemon because the base model already knows about Donald Trump.

The problem is when I want to draw a completely new-made up pokemon, the model doesn't know its name. Let's say my Pokemon is called "Megachu" which is basically a thick Pikachu with red body and wings. I want to introduce the model to Megachu by drawing Megachu myself and show the image to the model somehow. There are common ways of doing this which are Dreambooth, Textual Inversion, DreamArtist, etc but they all require me to train the model which takes long time and is costly.

So what I want is to somehow feed the model Pokemon embedding vectors so that the model knows how to draw any pokemon based on its embedding instead of its name. Given a new Pokemon like Megachu, I want to just run the Megachu image through an embedding extraction process and feed the embedding to the model so that it can draw my Megachu. I think it should be roughly similar to face embedding training process.

I am very new to Stable Diffusion architecture in general. Please suggest me a way

  1. to train the embedding vectors for Pokemon from 1000 images (preferably using existing weights from Stable Diffusion to assist the process so that it can be accurate with low amount of data). Which model layer should I modify? Should I just represent this vector as tokens?
  2. to extract the Pokemon embedding from any image. Should the model that does the embedding be separated from the Diffusion model itself?

I tried Stable Diffusion Variations and it doesn't preserve the character. For example, if I give it Megachu, it will change my pokemon's color, change the wing shape, or body thickness.


Solution

  • I have already studied Stable Diffusion architecture in detail so let me answer it myself:

    1. Train embedding models separately similar to the face embedding models or CLIP models. Then concatenate the embedding vector to every layer of UNet and train it.
    2. Yes. Study the model architecture of Stable Diffusion and you will see how to inject this special embedding as guidance for image generation. It's called conditioning the model.