class: center, middle # DALL-E 2 Edouard Yvinec .affiliations[ ![Sorbonne](images/logo_sorbonne.png) ![Isir](images/logo_isir.png) ![Datakalab](images/logo_datakalab.png) ![Epita](images/Epita.png) ] --- To understand DALLE-2, we need to understand a few concepts: - CLIP - diffusion models --- # CLIP [CLIP](https://arxiv.org/pdf/2103.00020.pdf) is a very efficient encoder for image-caption related tasks. -- The base idea is fairly simple: use that predicting which image goes with which caption as pre-training. .center[
] --- # Models In CLIP, we use two models. An encoder for the iamge and an encoder for the caption. -- For the Image encoder there are two options 1. A highly modified ResNet 50 2. a ViT -- For the text encoder, they use a transformer --- # Training Dataset They build there own dataset consisting of 400M pairs of examples. For the sake of clarity and perspective | Dataset | size | task | Supervision | | --- | --- | --- | --- | | CityScapes | 5K | semantic segmentation | labelled | | MSCOCO | 1OOK | object detection | labelled | | ImageNet | 1.3M | classification | labelled | | ImageNet 21K | 21M | classification | labelled | | Instagram | 1B | - | unlabelled | | JFT3B | 3B | - | unlabelled | | CLIP dataset | 400M | - | labelled | --- # Training task We want to have a similarity between embeddings (or representations or codes) of a pair. If we note $I_f$ the iamge features and $T_f$ the text features. We want them to have a high similairty for positive pairs and low similarity for negative pairs. Let $l$ be the logits defined as the dot product of he normalized text and iamge features. then the loss function is $$ L = CE(l, y, axis = 0) + CE(l, y, axis = 1) $$ where $y$ are the labels (simply the identity matrix) --- # Training process The training various from 12 to 18 days using between 256 and 592 GPU V100 (equivalent to your RTX 3090). .center[
] --- # Zero-shot transfer .center[
] --- # Diffusion Models We saw several generative models: -- - VAE - GAN -- - Flow-based models - Diffusion models This is an opportunity to complete the previous class. --- # Flow-Based Models The core idea of normalizing flow models, introduced [here](https://arxiv.org/pdf/1505.05770.pdf), is to transform a simple distribution in a more complex one via a series of invertable transformations. .center[
] --- # Invertible Transformations Here are some exampels of invertible transformations - affine coupling layer: take two inputs $x\_1$ and $x\_2$, we output $y\_1 = x\_1$ and $y\_2 = x\_1 e^{s \times x\_2} + t + x\_2$ - GLOW: normalization + $1\times1$ conv + affine coupling - Auto-regressive models: MLP-like --- # Diffusion Models Diffusion models are inspired by non-equilibrium thermodynamics. The core idea consists in adding gradually a gaussian noise and learn to remove it. .center[
] --- # Pros & Cons They are able to fit most distributions (flexible). -- They are very costly to train and sample from. --- # Back to DALL-E 2 Their solution uses a diffusion model as a decoder to learn to match the encoded text distribution. .center[
] --- class: middle, center # See you in 15 minutes for the lab session