CS 180 Project 5

Part A.0: Setup

In Part A, we use the DeepFloyd IF 2 stage diffusion model. To use the model, we had to generate a prompt embedding (a 4096 dimensional vector in our case). Some of the text prompts we came up with are the following:

Sand Planet Hatsune Miku
end of the world as we know it
steampunk machination

Using the seed 100 and two settings for num_inference_steps, we generated the following images:

`num_inference_steps=20`

Sand Planet Hatsune Miku

End of the world as we know it

Steampunk machination

`num_inference_steps=100`

Sand Planet Hatsune Miku

End of the world as we know it

Steampunk machination

Part A.1: Sampling Loops

A.1.1 The Forward Process

In this section, we implement the forward process, taking a clean image and obtaining a noisy one by sampling from a Gaussian with mean \(\sqrt{\bar\alpha_t}x_0\) and variance \((1 - \bar\alpha_t)\) \[x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1)\] where \(x_0\) is the clean image and \(x_t\) is the noisy one. To thest this, I use an image of the Campanile image, resized to 64x64 and applied the forward process using the provided alphas_cumprod from DeepFloyd

Clean

t=250

t=750

A.1.2: Classical Denoising

We then attempted to denoise the noisy campanile images using classical methods: Gaussian blur filtering with carying kernel sizes and standard deviations. The results are shown below.

t=250

t=750

k=7, \(\sigma\)=1.0

k=9, \(\sigma\)=1.5

A.1.3: One-Step Denoising

We then used the pretrained diffusion model to denoise. The Stage 1 UNet from DeepFloyd can predict the noise in the image given the noisy image, the timestep, and the text-conditioning embedding (in our case "a high quality photo"). Using the noise it predicts, we can estimate the clean image as follows: \[\hat{x}_0 = \frac{x_t - \sqrt{1 - \bar\alpha_{t}}\, {\epsilon}}{\sqrt{\bar\alpha_{t}}}\] We followed the followng procedure for the 3 noisy images from A.1.2 (t = [250, 500, 750]):

Use your forward function to add noise to your Campanile.
Estimate the noise in the new noisy image, by passing it through stage_1.unet
the noise from the noisy image to obtain an estimate of the original image.

Original image

noisy campanile t=250

noisy campanile t=500

noisy campanile t=750

one step denoised at t=250

one step denoised at t=500

one step denoised at t=750

A.1.4: Iterative Denoising

One step denoising is certainly better than a Gaussian blur filter, but it struggles at high noise levels. Thus, we implemented iterative denoising to remedy this. Rather than denoising all in one step, we remove a little bit of noise at a time. Starting at timestep 1000, we iteratively step until we reach timestep 0. Although stepping 1 timestep at a time would probably give best results, we do not have the computational time nor money, so we skip some steps. Starting at t=990, we step down 30 steps at a time until we reach 0. To predict the slightly less noisy image at the next step, we followed the equation \[x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma\] Where:

\(x_t\) is your image at timestep
\(x_{t'}\) is your noisy image at timestep \(t'\) where \(t' < t\) (less noisy)
\(\bar\alpha_t\)is defined by alphas_cumprod, from the model.
\(\alpha_t = \frac{\bar\alpha_t}{\bar\alpha_{t'}}\)
\(\beta_t = 1 - \alpha_t\)
\(x_0\) is our current estimate of the clean image using one-step denoising

noisy campanile at t=690

noisy campanile at t=540

noisy campanile at t=390

noisy campanile at t=240

noisy campanile at t=90

Iteratively denoised campanile

one step denoised campanile

gaussian blurred campanile

A.1.5: Diffusion Model Sampling

Another thing we can try is starting with pure noise and i_start=0 which is equivalent to starting from t=990. This will denoise noise and essentially hallucinate an image. Using the prompt "a high quality photo", the diffusion model created the following images:

A.1.6: Classifier-Free Guidance (CFG)

The results are pretty cool, but to be honest, I can't make heads or tails of what some of the images are. To improve image quality, we can use something called Classifier-Free Guidance (CFG). In CFG, we compute both a conditional and an unconditional noise estimate. We denote these \(\epsilon_c\) and \(\epsilon_u\). Then, we let our new noise estimate be: \[\epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u)\] Where \(\gamma\) is a tunable parameter. Notice that for \(\gamma=0\), we get an unconditional noise estimate, and for \(\gamma=1\) we get the conditional noise estimate. When \(\gamma>1\) we get higher quality images mysteriously. To get the unconditional noise estimate, we used a null prompt embeddign to the model. Using the conditional prompt "a high quality photo" and \(\gamma=7\) we obtained the following results

A.1.7.0 Image-to-image translation

Now, let's do something fun, let's use the CFG to edit an image. We can add noise to an image and use the CFG to denoise the image. This causes to hallucinate to fill in the gaps. Using the prompt "a high quality photo" and noise levels [1, 3, 5, 7, 10, 20] (corresponding to indices of our strided timesteps list, so larger number correspond to less noisy images) we generated the following images:

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original image

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

A.1.7.1

We then performed the same procedure expept this time with web images and doodles.

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

A.1.7.2

We can use the same procedure to implement inpainting. Given an image , and a binary mask, we can create a new image that has the same content where the mask is 0, but new content wherever it is 1. To do this, we can run the diffusion denoising loop. But at every step, after obtaining the prediction, we force the prediction to have the same pixels as the original image where the mask is 0. Essentially, we leave everything inside the edit mask alone, but we replace everything outside the edit mask with our original image -- with the correct amount of noise added for timestep. Using this procedure, and the same prompt as earlier, we obtained the following:

original image

Mask

To replace

inpainted

original image

Mask

To replace

inpainted

original image

Mask

To replace

inpainted

A.1.7.3: Text-Conditional Image-to-image Translation

You might be wondering, "How come section A.1.7 is so long?" Boy do I wish I had an answer to that. You've made it to the last part of this section! All our previous images we generated using the prompt "a high quality image", now let's try it with new prompts

End of the world as we know it

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

Clockwork machination

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

Sand planet hatsune miku

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

A.1.8: visual anagrams

In this part, we implement Visual Anagrams and create optical illusions with diffusion models. In this part, we will create an image that looks like a certain prompt, but when flipped upside down will reveal a different prompt. To do this, we will denoise an image \(x_t\) at step \(t\) normally with the prompt \(p_1\), to obtain noise estimate \(\epsilon_1\). But at the same time, we will flip \(x_t\) upside down, and denoise with the prompt \(p_2\), to get noise estimate \(\epsilon_2\). We can flip back, and average the two noise estimates. We then denoise with the average noise estimate. The full algorithm is described as: \[\epsilon_1 = \text{CFG of UNet}(x_t, t, p_1)\] \[\epsilon_2 = \text{flip}(\text{CFG of UNet}(\text{flip}(x_t), t, p_2))\] \[\epsilon = (\epsilon_1 + \epsilon_2) / 2\] Here are the results:

A sketch of a skull

A sketch of a biblically accurate angel

An oil painting of a blue rose

An oil painting of an electric guitar

A.1.9

In this part, we will make hybrid images. We also do this by using a composite noise estimator by estimating the noise of two different promots. The composite noise is found through the following algorithm: \[\epsilon_1 = \text{CFG of UNet}(x_t, t, p_1)\] \[\epsilon_2 = \text{CFG of UNet}(x_t, t, p_2)\] \[\epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)\] We found a gaussian with a size of 15 and a sigma of 1.8 worked well.

Hybrid image of a cat and flowers

Hybrid image of a polar bear and nixie tubes

Part B: Flow Matching from Scratch

B.1.1: Implementing a UNet

We start by building a one-step denoiser using a UNet. This network takes in a noisy image and spits out a denoised one. We followed the following architecture:

B1.2.0: Using the UNet to Train a Denoiser

We will now UNet to Train a Denoiser. To do this, we train on pairs of data, the noisy image and the clean image. To generate noisy images we used the following \[z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I).\] Where z is the noisy image, x is the original and sigma is the amount of noise to be added. For our data, we are using MNIST digits. Here is the noising process for \(\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]\)

sigma = 0.0

sigma = 0.2

sigma = 0.4

sigma = 0.5

sigma = 0.6

sigma = 0.8

sigma = 1.0

B.1.2.1: Training

Here we train a denoiser to denoise noisy image with \(\sigma = 0.5\) applied to a clean image. Using the MNIST dataset, we create a dataloader using torch. Using a batch size of 256, 5 epochs, hidden dimension 128, MSE as the loss function and the Adam optimizer with a learning rate of 1e-4, we trained the model.

Training loss

After 1 Epoch

Input

Noisy \(\sigma=0.5\)

Output

Input

Noisy \(\sigma=0.5\)

Output

Input

Noisy \(\sigma=0.5\)

Output

After 5 Epochs

Input

Noisy \(\sigma=0.5\)

Output

Input

Noisy \(\sigma=0.5\)

Output

Input

Noisy \(\sigma=0.5\)

Output

B.1.2.2:

Our denoiser was trained on MNIST digits noised with \(\sigma=5\). Let's see how it performs on different \(\sigma\)'s that it wasn't trained for.

Input

Noisy \(\sigma=0.0\)

Noisy \(\sigma=0.2\)

Noisy \(\sigma=0.4\)

Noisy \(\sigma=0.5\)

Output

Noisy \(\sigma=0.6\)

Noisy \(\sigma=0.8\)

Noisy \(\sigma=1.0\)

Output

B.1.2.3: Denoising Pure Noise

Let's make it generative! In this section, we trained the previous model on pure noise and tried to map it back to the MNIST digits. Using the same loss, same number of epochs and other parameters we get a blob that looks digit-like. Because the noise is independent of the digits, we get some approximation of the average shape. Must be why alarm clocks have lights that form an 8 when they are all on.

Loss

after 1 epoch

After 5 epochs

Part B.2: Training a flow matching model

B.2.2

B.2.3

after 1 epoch

after 5 epoch

after 10 epoch

B.2.5

B.2.6

after 1 epoch

after 5 epoch

after 10 epoch

Remove Learning Rate Scheduler

The model does not perform as well without the sceduler, so what we had to do was decrease the learning rate of the model, so it is able to learn things more precisely.

epoch1

epoch5

epoch10