Deep Learning Lectures

class: center, middle

# New Archiectures

Edouard Yvinec

.affiliations[
  ![Sorbonne](images/logo_sorbonne.png)
  ![Isir](images/logo_isir.png)
  ![Datakalab](images/logo_datakalab.png)
  ![Epita](images/Epita.png)
]

---
## Outline

- NAS (Neural Architecture Search)
- Transformers
- MLP-Mixer
- Modern Training

---
# NasNet

.center[
 <img src="images/nasnet.png" style="width: 420px;" />
]

- search for the best sub-structures for a neural network
- NASNet uses an evolutionnary algorithm

All the possible models are tested (here a test includes a full training). NAS are usually very computation intensive, and thus it’s mostly big private lab
that works on it.

[source article](https://arxiv.org/pdf/1707.07012.pdf)

---
# RandWire

.center[
 <img src="images/randwire.png" style="width: 420px;" />
]

We can use a random search or reinforcement learning.
[source article](https://openaccess.thecvf.com/content_ICCV_2019/papers/Xie_Exploring_Randomly_Wired_Neural_Networks_for_Image_Recognition_ICCV_2019_paper.pdf)

---
# Budgeted Networks

.center[
 <img src="images/budget.png" style="width: 420px;" />
]

Even better, we can use soft relaxation of the problem :far from optimal solution but much faster.
[source article](https://openaccess.thecvf.com/content_cvpr_2018/papers/Veniat_Learning_TimeMemory-Efficient_Deep_CVPR_2018_paper.pdf)

---
# EfficientNet

.center[
 <img src="images/effnet.png" style="width: 580px;" />
]

One of the best ConvNet in terms of speed/accuracy trade-offs.

[source article](https://arxiv.org/pdf/1905.11946.pdf)

---
# EfficientNet V2

.center[
 <img src="images/effnetv2.png" style="width: 420px;" />
]

Marginal improvements (most changes are in the training process).

[source article](https://arxiv.org/pdf/2104.00298.pdf)

---
# MobileNet V3

.center[
 <img src="images/mobv3.png" style="width: 620px;" />
]

Added a squeeze and excite to the standard MobileNet V2 model

[source article](https://arxiv.org/pdf/1905.02244.pdf)

---
## Outline

- NAS (Neural Architecture Search)
- Transformers
- MLP-Mixer
- Modern Training

---

# NLP side step

.center[
 <img src="images/selfatt.gif" style="width: 620px;" />
]

Transformers are based on self-attention modules

---

# NLP side step

.center[
 <img src="images/self_att_stop.png" style="width: 620px;" />
]

Transformers are based on self-attention modules

---

# Mathematical derivations

Let $X_1,..X_N$ be a set of inputs (on iamges these will be patches of the image). We need to compute the queries $Q_1,...,Q_N$, keys $K_1,...,K_N$ and values $V_1,...,V_N$.

$$
\forall n \in \{1,...,N\}, \quad K_n = M_K X_n, \quad Q_n = M_Q X_n \quad \text{and} \quad V_n = M_V X_n
$$

Then, we compute the self-attentions $S_{n,m}$

$$
S_{n,m} = \langle Q_n; K_m \rangle
$$

The first coordinate ($n$) corresponds to the query (this is important). Then, we concatenate the corresponding self attentations to get:

$$
S\_n=S\_{n,1} \times V\_1 + ... + S\_{n,N} \times V\_N
$$

---
Finally the score is the softmax

$$
\text{Score}\_n = \text{softmax}(S\_n )
$$

- the self-attention values are cross-correlations
- the self-attentions are a ponderations for the values
- the final score is a probability distribution

---

# Attention is all you need

.center[
 <img src="images/transformer_decoding_1.gif" style="width: 620px;" />
]

[source article](https://arxiv.org/abs/1905.11946)

---

# Visual Image Transformers

.center[
 <img src="images/vit.png" style="width: 620px;" />
]

[source article](https://arxiv.org/pdf/2010.11929.pdf)

---

# Normalization

Most DNNs (especially convnets) use Batch-Normalization layers ([ref](https://arxiv.org/pdf/1502.03167.pdf)), which applied a per-channel normalization followed by a de-normalization as follows

$$
\text{BN}(X) = \gamma \frac{X - \mu}{\sigma} + \beta
$$

where $\gamma$ and $\beta$ are learned and $\mu$ and $\sigma$ are emprically estimated during training and stored for inference. Transformers use layer-normalization which applies a per-tensor normalization followed by a de-normalization

$$
\text{LN}(X) = \gamma \frac{X - \mu}{\sigma}
$$

where $\gamma$ is learned and $\mu$ and $\sigma$ are always emprically estimated.

---

# DeiT

.center[
 <img src="images/deit.png" style="width: 520px;" />
]

We lost the bias (see previous slide), consequently it will be harder for transformers to learn inductive bias and thus will need way more data than convnets. DeiT ([ref](https://arxiv.org/pdf/2012.12877.pdf)) tackles this issue with a lot of data augmentations and regularizations.

---

# Swin Transformers

.center[
 <img src="images/swin.png" style="width: 520px;" />
]

Previous transformers can not be used as backbones for detection, segmentation,... Wheras Swin can be ([ref](https://arxiv.org/pdf/2103.14030.pdf)).

---
## Outline

- NAS (Neural Architecture Search)
- Transformers
- MLP-Mixer
- Modern Training

---

# Depthwise Separable Convolutions

---

# MLP-Mixer

.center[
 <img src="images/mixer.png" style="width: 520px;" />
]

- Similarly to Separable Convolutions, apply a MLP on the channels dimension and a MLP on the patches dimensions
- Multiple other papers had the same idea at the same time (including ResMLP)
- Just a different weight sharing
- They are even more data greedy

---

# Performances

.center[
 <img src="images/mlp_mixer_perfs.png" style="width: 620px;" />
]

---
## Outline

- NAS (Neural Architecture Search)
- Transformers
- MLP-Mixer
- Modern Training

---

# Hyper-parameters

Modern methods have an increasing number of hyper-parameters.
- it is important to valdiate the best value for the hyper-parameters on a validation set
- if it is important to have the same distributions in train and test it is even more important to have the same distributions in validation and test

---

# Active Learning

.center[
 <img src="images/active.png" style="width: 350px;" />
]

- more data is almost always better
- unless it is poorly annotated
- active learning searches for the most important data to labelize for the dataset
- the method are not flly functional yet

---

# Early Stopping

.center[
 <img src="images/earlystop.png" style="width: 350px;" />
]

- prevent overfitting

---

# Scheduler

- A high learning rate during the beginning of the training may help
    - Acts as a regularization by skipping the local minima
- Then decrease learning rate gradually to go deeper in a local minima towards the training
end
    - Either decrease learning rate (usually divided by 10) at particular epochs
    - Or decrease if validation metric (loss, acc, etc.) doesn’t improve
- Some scheduling as Cosine decreases and increases (a little less) repetitively

---

# Smooth Labeling

$$
Y^{SL} = Y \times (1 - \alpha) + \frac{\alpha}{K}
$$

usually, $\alpha = 0.1$.

- Avoid overconfidence in the model, when confidence is always close to 0.999… And thus reduce overfitting
- Avoid miscalibration where model confidence is
not correlated to model accuracy

---

# Data-Augmentations

.center[
 <img src="images/mix.png" style="width: 650px;" />
]

- Acts as regularization to reduce overfitting
- A LOT of alternative to MixUp exists (CutOut, CutMix, FixMatch, MixMo, PuzzleMix, etc.)

---

# Knowledge Distillation

- Train one super-mega-large model (called teacher)
- Distill the knowledge of the teacher onto a smaller model (called student)
- The loss between the two is the KL divergence (VAE...)
- Often add a temperature $T$ to the logits such that we get

$$
\text{output} = \frac{e^{\frac{z_i}{T}}}{\sum e^{\frac{z_j}{T}}}
$$

- reduces the sharpness of the probabilities leading too more useful info

---

# Stochastic Depth

.center[
 <img src="images/stodepth.png" style="width: 650px;" />
]

randomly drop some skip connections

---
class: middle, center

# For next class think of topic that might interest you !