[SOLVED] Deconstructiong the Stable Diffusion 3.5 pipeline

Deconstructiong the Stable Diffusion 3.5 pipeline

I am trying to deconstruct the SD3.5 (specifically 3.5 medium) pipeline in order to have a controlled process over the denoising steps. I can't do callbacks because I need to modify the latent according to other pipelines.

I am trying to perform the steps on the following huggingface guide: https://huggingface.co/docs/diffusers/en/using-diffusers/write_own_pipeline#deconstruct-the-stable-diffusion-pipeline

I modified my text encoding to fit SD3.5, I also tried to load the entire pipeline and just run the encode_prompt function on it to get the text embedding and pooled embeddings for both the prompt and negative prompt. When running the function and putting its outputs as input to the regular pipeline instead of the prompt and negative prompt it works properly so it seems like this is not what's causing the problem.

I also changed the unet from the article to use the pre-trained transformer of the model. After that I adjusted the decoding to match the same decoding on the pipeline's source code on diffusers.

the output images don't look same as they are looking when running the pipeline through diffusers. I'm not sure where I can find a similar implementation to deconstruction of the SD3 pipeline or what am I missing.

enter image description here

Solution

It looks like the issue might be with the noise scheduler or latent processing. Make sure your scheduler settings match the default in diffusers, and check if the UNet’s predicted noise aligns with the official pipeline. If the results are still off, compare the shape and scale of your latents.