Deep Learning Lectures

class: center, middle

# DALL-E 2

Edouard Yvinec

.affiliations[
  ![Sorbonne](images/logo_sorbonne.png)
  ![Isir](images/logo_isir.png)
  ![Datakalab](images/logo_datakalab.png)
  ![Epita](images/Epita.png)
]

---

To understand DALLE-2, we need to understand a few concepts:
- CLIP 
- diffusion models

---

# CLIP

[CLIP](https://arxiv.org/pdf/2103.00020.pdf) is a very efficient encoder for image-caption related tasks. 
--
The base idea is fairly simple: use that predicting which image goes with which caption as pre-training.

.center[
          <img src="images/caption_image.jpg" style="width: 720px;" />
]

---

# Models

In CLIP, we use two models. An encoder for the iamge and an encoder for the caption.

For the Image encoder there are two options
 1. A highly modified ResNet 50
 2. a ViT

For the text encoder, they use a transformer

---

# Training Dataset

They build there own dataset consisting of 400M pairs of examples. For the sake of clarity and perspective

| Dataset | size | task | Supervision |
| --- | --- | --- | --- |
| CityScapes | 5K | semantic segmentation | labelled |
| MSCOCO | 1OOK | object detection | labelled |
| ImageNet | 1.3M | classification | labelled |
| ImageNet 21K | 21M | classification | labelled |
| Instagram | 1B | - | unlabelled |
| JFT3B | 3B | - | unlabelled |
| CLIP dataset | 400M | - | labelled |

---

# Training task

We want to have a similarity between embeddings (or representations or codes) of a pair.

If we note $I_f$ the iamge features and $T_f$ the text features. We want them to have a high similairty for positive pairs and low similarity for negative pairs.

Let $l$ be the logits defined as the dot product of he normalized text and iamge features. then the loss function is

$$
L = CE(l, y, axis = 0) + CE(l, y, axis = 1)
$$

where $y$ are the labels (simply the identity matrix)

---

# Training process

The training various from 12 to 18 days using between 256 and 592 GPU V100 (equivalent to your RTX 3090).

.center[
          <img src="images/caption.png" style="width: 720px;" />
]

---

# Zero-shot transfer

.center[
          <img src="images/zero_shot.png" style="width: 720px;" />
]

---

# Diffusion Models

We saw several generative models:

--
- VAE
- GAN
--

- Flow-based models
- Diffusion models

This is an opportunity to complete the previous class.

---

# Flow-Based Models

The core idea of normalizing flow models, introduced [here](https://arxiv.org/pdf/1505.05770.pdf), is to transform a simple distribution in a more complex one via a series of invertable transformations.

.center[
          <img src="images/norm_flow.png" style="width: 720px;" />
]

---

# Invertible Transformations

Here are some exampels of invertible transformations

- affine coupling layer: take two inputs $x\_1$ and $x\_2$, we output $y\_1 = x\_1$ and $y\_2 = x\_1 e^{s \times x\_2} + t + x\_2$
- GLOW: normalization + $1\times1$ conv + affine coupling
- Auto-regressive models: MLP-like

---

# Diffusion Models

Diffusion models are inspired by non-equilibrium thermodynamics. The core idea consists in adding gradually a gaussian noise and learn to remove it.

.center[
          <img src="images/diffusion.png" style="width: 720px;" />
]

---

# Pros & Cons

They are able to fit most distributions (flexible).

They are very costly to train and sample from.

---

# Back to DALL-E 2

Their solution uses a diffusion model as a decoder to learn to match the encoded text distribution.

.center[
          <img src="images/dalle2.png" style="width: 720px;" />
]

---
class: middle, center
# See you in 15 minutes for the lab session