Suppose I have a dataset of 1000 pokemon, I have 10 images of Pikachu, 10 images of Bulbasaur, etc. I also have a metadata specifying the name of each pokemon exactly. So from the metadata, I can know which image is Pikachu and which image is not. I want to fine-tune a Stable Diffusion model to draw pokemons with prompt like this "a drawing of [name]" where name is the name of the pokemon I want to draw. It should be fine to draw any Pokemon with well-known name in the dataset. I should probably even be able to draw Donald Trump in the style of pokemon because the base model already knows about Donald Trump.
The problem is when I want to draw a completely new-made up pokemon, the model doesn't know its name. Let's say my Pokemon is called "Megachu" which is basically a thick Pikachu with red body and wings. I want to introduce the model to Megachu by drawing Megachu myself and show the image to the model somehow. There are common ways of doing this which are Dreambooth, Textual Inversion, DreamArtist, etc but they all require me to train the model which takes long time and is costly.
So what I want is to somehow feed the model Pokemon embedding vectors so that the model knows how to draw any pokemon based on its embedding instead of its name. Given a new Pokemon like Megachu, I want to just run the Megachu image through an embedding extraction process and feed the embedding to the model so that it can draw my Megachu. I think it should be roughly similar to face embedding training process.
I am very new to Stable Diffusion architecture in general. Please suggest me a way
I tried Stable Diffusion Variations and it doesn't preserve the character. For example, if I give it Megachu, it will change my pokemon's color, change the wing shape, or body thickness.
I have already studied Stable Diffusion architecture in detail so let me answer it myself: