class: center, middle # New Archiectures Edouard Yvinec .affiliations[ ![Sorbonne](images/logo_sorbonne.png) ![Isir](images/logo_isir.png) ![Datakalab](images/logo_datakalab.png) ![Epita](images/Epita.png) ] --- ## Outline -
NAS (Neural Architecture Search)
- Transformers - MLP-Mixer - Modern Training --- # NasNet .center[
] -- - search for the best sub-structures for a neural network - NASNet uses an evolutionnary algorithm -- All the possible models are tested (here a test includes a full training). NAS are usually very computation intensive, and thus it’s mostly big private lab that works on it. [source article](https://arxiv.org/pdf/1707.07012.pdf) --- # RandWire .center[
] We can use a random search or reinforcement learning. [source article](https://openaccess.thecvf.com/content_ICCV_2019/papers/Xie_Exploring_Randomly_Wired_Neural_Networks_for_Image_Recognition_ICCV_2019_paper.pdf) --- # Budgeted Networks .center[
] Even better, we can use soft relaxation of the problem :far from optimal solution but much faster. [source article](https://openaccess.thecvf.com/content_cvpr_2018/papers/Veniat_Learning_TimeMemory-Efficient_Deep_CVPR_2018_paper.pdf) --- # EfficientNet .center[
] One of the best ConvNet in terms of speed/accuracy trade-offs. [source article](https://arxiv.org/pdf/1905.11946.pdf) --- # EfficientNet V2 .center[
] Marginal improvements (most changes are in the training process). [source article](https://arxiv.org/pdf/2104.00298.pdf) --- # MobileNet V3 .center[
] Added a squeeze and excite to the standard MobileNet V2 model [source article](https://arxiv.org/pdf/1905.02244.pdf) --- ## Outline - NAS (Neural Architecture Search) -
Transformers
- MLP-Mixer - Modern Training --- # NLP side step .center[
] Transformers are based on self-attention modules --- # NLP side step .center[
] Transformers are based on self-attention modules --- # Mathematical derivations Let $X_1,..X_N$ be a set of inputs (on iamges these will be patches of the image). We need to compute the queries $Q_1,...,Q_N$, keys $K_1,...,K_N$ and values $V_1,...,V_N$. $$ \forall n \in \{1,...,N\}, \quad K_n = M_K X_n, \quad Q_n = M_Q X_n \quad \text{and} \quad V_n = M_V X_n $$ Then, we compute the
self-attentions
$S_{n,m}$ $$ S_{n,m} = \langle Q_n; K_m \rangle $$ The first coordinate ($n$) corresponds to the query (this is important). Then, we concatenate the corresponding self attentations to get: $$ S\_n=S\_{n,1} \times V\_1 + ... + S\_{n,N} \times V\_N $$ --- Finally the score is the softmax $$ \text{Score}\_n = \text{softmax}(S\_n ) $$ - the self-attention values are cross-correlations - the self-attentions are a ponderations for the values - the final score is a probability distribution --- # Attention is all you need .center[
] [source article](https://arxiv.org/abs/1905.11946) --- # Visual Image Transformers .center[
] [source article](https://arxiv.org/pdf/2010.11929.pdf) --- # Normalization Most DNNs (especially convnets) use Batch-Normalization layers ([ref](https://arxiv.org/pdf/1502.03167.pdf)), which applied a per-channel normalization followed by a de-normalization as follows $$ \text{BN}(X) = \gamma \frac{X - \mu}{\sigma} + \beta $$ where $\gamma$ and $\beta$ are learned and $\mu$ and $\sigma$ are emprically estimated during training and stored for inference. Transformers use layer-normalization which applies a per-tensor normalization followed by a de-normalization $$ \text{LN}(X) = \gamma \frac{X - \mu}{\sigma} $$ where $\gamma$ is learned and $\mu$ and $\sigma$ are always emprically estimated. --- # DeiT .center[
] We lost the bias (see previous slide), consequently it will be harder for transformers to learn inductive bias and thus will need way more data than convnets. DeiT ([ref](https://arxiv.org/pdf/2012.12877.pdf)) tackles this issue with a lot of data augmentations and regularizations. --- # Swin Transformers .center[
] Previous transformers can not be used as backbones for detection, segmentation,... Wheras Swin can be ([ref](https://arxiv.org/pdf/2103.14030.pdf)). --- ## Outline - NAS (Neural Architecture Search) - Transformers -
MLP-Mixer
- Modern Training --- # Depthwise Separable Convolutions --- # MLP-Mixer .center[
] - Similarly to Separable Convolutions, apply a MLP on the channels dimension and a MLP on the patches dimensions - Multiple other papers had the same idea at the same time (including ResMLP) - Just a different weight sharing - They are even more data greedy --- # Performances .center[
] --- ## Outline - NAS (Neural Architecture Search) - Transformers - MLP-Mixer -
Modern Training
--- # Hyper-parameters Modern methods have an increasing number of hyper-parameters. - it is important to valdiate the best value for the hyper-parameters on a validation set - if it is important to have the same distributions in train and test it is even more important to have the same distributions in validation and test --- # Active Learning .center[
] - more data is almost always better - unless it is poorly annotated - active learning searches for the most important data to labelize for the dataset - the method are not flly functional yet --- # Early Stopping .center[
] - prevent overfitting --- # Scheduler - A high learning rate during the beginning of the training may help - Acts as a regularization by skipping the local minima - Then decrease learning rate gradually to go deeper in a local minima towards the training end - Either decrease learning rate (usually divided by 10) at particular epochs - Or decrease if validation metric (loss, acc, etc.) doesn’t improve - Some scheduling as Cosine decreases and increases (a little less) repetitively --- # Smooth Labeling $$ Y^{SL} = Y \times (1 - \alpha) + \frac{\alpha}{K} $$ usually, $\alpha = 0.1$. - Avoid overconfidence in the model, when confidence is always close to 0.999… And thus reduce overfitting - Avoid miscalibration where model confidence is not correlated to model accuracy --- # Data-Augmentations .center[
] - Acts as regularization to reduce overfitting - A LOT of alternative to MixUp exists (CutOut, CutMix, FixMatch, MixMo, PuzzleMix, etc.) --- # Knowledge Distillation - Train one super-mega-large model (called teacher) - Distill the knowledge of the teacher onto a smaller model (called student) - The loss between the two is the KL divergence (VAE...) - Often add a temperature $T$ to the logits such that we get $$ \text{output} = \frac{e^{\frac{z_i}{T}}}{\sum e^{\frac{z_j}{T}}} $$ -- - reduces the sharpness of the probabilities leading too more useful info --- # Stochastic Depth .center[
] randomly drop some skip connections --- class: middle, center # For next class think of topic that might interest you !