This project is about playing with diffusion models, implementing diffusion sampling loops, and exploring the power of diffusion models. In the second part, we go deeper into the architecture of the diffusion model by implementing a unet and ddpm (Denoising Diffusion Probabilistic Models) from scratch.
In this part, we will use an existing diffusion model, DeepFloyd, to sample images. This will give us an understanding of how diffusion models work and how they can be used to generate images. This is a great way to get a feel for diffusion models before we implement them from scratch in the next part.
First, since diffusion takes a while, we need to choose our device. We will use GPUs since they are faster than CPUs. We have tried both using Google Colab and our local machine's metal GPU. Then, we choose the model we want to use -- DeepFloyd. To use it, we load the model, then we take some text embeddings that the model understands and use them to generate images.
We used the seed 276 to generate the images. This seed is used internally by the model to generate the images. Here are some of the images we generated with 2 different inference steps. Notice how the higher the inference step, the more detailed or clear the image becomes:
prompt | inference 10 | inference 20 |
---|---|---|
an oil painting of a snowy mountain village | ||
a man wearing a hat | ||
a rocket ship |
A key idea of diffusion models is to understand the noising process of an image. This is defined by adding Gaussian noise as such:
which is equivalent to:
To see this in action, we implement the forward process of the diffusion model. We start with a clean image and add noise to it. Here are some results given a certain timestep `t` for the Campenilli tower:
t | 0 | 250 | 500 | 750 |
---|---|---|---|---|
image |
The reason why this is important is that diffusion models make use of this noising process to denoise images into AI-generated images. But first, let's see how we can denoise images using classical methods. One can use a simple Gaussian filter to denoise images, though we will get pretty subpar results. Here are some results of denoising the Campenilli tower image using a Gaussian filter:
t | 250 | 500 | 750 |
---|---|---|---|
kernel size | 3 | 5 | 9 |
noisy image | |||
classically denoised image |
Now that we have seen how classical denoising works, we can compare it to the denoising process of the diffusion model. Think of it like an AI predicting the denoised image. How we can do this is to first apply noise to the original image at a timestep `t`, then use the model to predict the denoised image given `t`. Here are some results of the denoising process of the Campenilli tower image:
t | 250 | 500 | 750 |
---|---|---|---|
noisy image | |||
denoised image |
We can see that the denoised images are much better than the classically denoised images. Except, we see that the more `t` increases, the less accurate the denoised image becomes. One amazing thing is that we already generated fake images that "look" like the Campenilli tower. This is the power of diffusion models.
So actually, the true denoising process done in practice is to iteratively denoise the image. This is how we can improve the denoised image. To do this, we take a loop over a step in `t` and denoise the image iteratively. To iteratively, denoise, we have the following formula:
where:
The $v_\sigma$ is random noise, which in the case of DeepFloyd is also predicted. The process to compute this is not very important for us, so we supply a function, `add_variance`, to do this for you.
Doing a strided timestep of 30 at each step, starting from timestep 990 all the way to zero, and then starting on the 10th in this strided timestep, we can get the results shown below:
iteration | 10 | 15 | 20 | 25 | 30 |
---|---|---|---|---|---|
denoised image |
And we can compare this with our one step denoising and our classical denoising:
Original | Iterative Denoising | One Step Denoising | Classical Denoising |
---|---|---|---|
We can see that the iterative denoising process is much better than the one-step denoising process and the classical denoising process.
Aside from denoising an image to get the Campenilli, we can also just start with random noise, and the model will automatically try to get an image on the manifold of real images! Here are some results of sampling images from random noise. Unfortunately, they are not great, but it is still interesting to see how the model tries to generate images from random noise:
Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
---|---|---|---|---|
To improve the quality of the generated images, we can use a classifier-free guidance. Basically, we can give a text prompt to the model to guide it in generating images, kind of like with Dall-E. To make our images better, we can condition in on the prompt "a high quality photo". This gives us better results at the expense of diversity.
To implement this, we have to not only predict the noise from the model with no condition, but also predict the noise from the model with the condition. We then take the difference between the two to get the noise that we want to add to the image with some guidance scale. In CFG, we compute both a noise estimate conditioned on a text prompt, and an unconditional noise estimate. We denote these $\epsilon_c$ and $\epsilon_u$. Then, we let our new noise estimate be
where $\gamma$ controls the strength of CFG. Notice that for $\gamma=0$, we get an unconditional noise estimate, and for $\gamma=1$ we get the conditional noise estimate. The magic happens when $\gamma > 1$. With the guidance scale of `7`, we can get the following results:
Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
---|---|---|---|---|
We can see that the images are qualitatively better than the ones generated without the guidance scale.
So we've seen two things so far: Getting an image similar to the Campenilli tower but not quite, and generating random realistic images from noise. But what if we want a similar image to the Campenilli tower, but also completely different? One thing we can actually do is to take an image and translate it to another image. This is called image to image translation. We can take the initial image, noise it a little, then denoise it to get a new image. Essentially, we are asking the model to get an image "similar" to the Campenilli tower but not quite. The more we noise the image, the more different the image will be from the original. Here are some results of image to image translation for different noise levels (defined by `i_start`, the timestep to start denoising where the lower the `i_start`, the more noise is added):
Original | i_start = 1 | i_start = 3 | i_start = 5 | i_start = 7 | i_start = 10 | i_start = 20 |
---|---|---|---|---|---|---|
We can also use the model to edit hand-drawn images and web images with the same process. Here are some results of editing hand-drawn and web images:
Original | i_start = 1 | i_start = 3 | i_start = 5 | i_start = 7 | i_start = 10 | i_start = 20 |
---|---|---|---|---|---|---|
NOTE: The first web image came from https://i.pinimg.com/originals/22/9c/56/229c56342f2303522a3189376909dfb5.jpg.
More Results (looking like an actual tree!):
Image 2 with i_start = 22 |
---|
We can essentially run the same procedure on an image with a mask to inpaint the image. To do this, we can run the denoising loop with doing the following every loop:
Essentially, we leave everything inside the edit mask alone, but we replace everything outside the edit mask with our original image -- with the correct amount of noise added for timestep $t$.
Here are some results of inpainting:
Original | Mask | To Replace | Inpainted |
---|---|---|---|
Now, we will try to do condition our image to image translation. So far we have made it a "high quality photo", but what if we want to condition it on a text prompt?
Here is an example of our same images being conditioned on the prompt "a rocket ship":
i_start = 1 | i_start = 3 | i_start = 5 | i_start = 7 | i_start = 10 | i_start = 20 |
---|---|---|---|---|---|
These are completely differently themed rockets! But actually we can use pretty much any prompt that the model understands. For example, taking image 2 with the prompt "an oil painting of people around a campfire", we get the following result:
It's pretty cool to see my drawings transform into something else.
But we can do more than that! We can also do visual anagrams. We can take an image that when flipped, looks like another image. The algorithm is as follows:
Essentially, we remove noise to make the image look like prompt 1, than add noise to make the flipped image look like prompt 2. Then we average the noise to get the final noise. Here are some results of visual anagrams:
Prompt 1 | Prompt 2 | Visual Anagram | Visual Anagram Flipped |
---|---|---|---|
an oil painting of an old man | an oil painting of people around a campfire | ||
a lithograph of a skull | a photo of a man | ||
a photo of a dog | a pencil |
You can really see it. It's pretty cool!
We can also do hybrid images. We can take two images and combine them to get a hybrid image. A hybrid image is an image that looks like one image from far away, but another image up close.
The algorithm is as follows:
where UNet is the diffusion model UNet, $f_\text{lowpass}$ is a low pass function, $f_\text{highpass}$ is a high pass function, and $p_1$ and $p_2$ are two different text prompt embeddings. Our final noise estimate is $\epsilon$. Here are some results of hybrid images:
Prompt 1 | Prompt 2 | Hybrid Image |
---|---|---|
a lithograph of waterfalls | a lithograph of a skull | |
a rocket ship | a pencil | |
a photo of a man | a photo of a dog |
A denoising diffusion model is essentially a UNet. We can implement a UNet from scratch. The UNet is a neural network that takes an image and predicts the noise to add to the image to denoise it. The UNet is defined as follows:
We can train it on the mnist dataset so that it predicts the noise to add to the image to denoise it. Essentially, we recreate the noise process for the mnist dataset (sigma being how noisy it is):
Then we can use a MSE loss to train the UNet. Our hyperparameters are as follows:
We get the following loss curve when training on images of sigma = 0.5:
Here are some results of training the UNet on the mnist dataset:
Epoch = 1 | Epoch = 5 |
---|---|
We can notice better denoising as the epochs increase.
We can also test the UNet on out of distribution data outside of sigma = 0.5. Results are below, but notice how it does well for some sigmas but begins to fail for sigma outside of 0.5:
Instead of just predicting the clean image, we can also predict the noise given the timestep of how noisy the image is. This is called a Denoising Diffusion Probabilistic Model (DDPM). Essentially we can therefore make AI generated number images. We can implement this from scratch. The DDPM is defined as follows:
with the following hyperparameters:
We get the following loss curve when training on the mnist dataset:
log scale | linear scale |
---|---|
Here are some results of training the DDPM on the mnist dataset:
Epoch = 5 | Epoch = 20 |
---|---|
It's reasonable after 20 epochs. A problem is that we can't sample specific numbers. Let's fix that.
We can condition the DDPM on the number we want to generate. With that small tweak and the same hyperparameters as before, we can get the following loss curve when training on the mnist dataset. We apply classifier free guidance with a guidance scale of 5:
log scale | linear scale |
---|---|
Here are some results of training the Conditional DDPM on the mnist dataset:
Epoch = 5 | Epoch = 20 |
---|---|