class: center, middle # From Images to Video Edouard Yvinec .affiliations[ ![Sorbonne](images/logo_sorbonne.png) ![Isir](images/logo_isir.png) ![Datakalab](images/logo_datakalab.png) ![Epita](images/Epita.png) ] --- This class is mostly based on articles from NeurIPS 2022. But before we get into the details, we have to introduce the base tools for video processing. - pre-processing - 3D convolutions - Conv-LSTM - 3D self-attention --- # Pre-Processing The pre-processing is usually the same as the pre-processing for per-image predictions - use the image as such -- - use the scaled image in $[0;1]$ or $[-1;1]$ -- - use the normalized iamge using a mean and variance per-channel --- # 2D convolutions We studied 2D convolutions which take an image (2D object) and apply a kernel to it. .center[
] --- # 3D convolutions We can use 3D convolutions which are applied to volumes such as a video. .center[
] Given a video sequence of 20 squared images of 224 pixels in height. The output wille be a -- video sequence as well. Assume the output video sequence contains 15 images. Considering a $3\times 3$ convolution, how many paramters does this layer have? -- $$ (20 \times 3) \times (3\times 3) \times (15 \times 3) = 24300 $$ --- # LSTM We recall what an LSTM cell is: -- We need to deinfe the gates $i$ (input gate), $f$ the forget gate and $o$ the output gate. We note $L$ a linear transformation, $c$ the cell outputs and $h$ the hidden state. $$ i\_t = \text{sigmoid}(L(x\_t, h\_{t-1}, c\_{t-1})) $$ $$ f\_t = \text{sigmoid}(L(x\_t, h\_{t-1}, c\_{t-1})) $$ $$ c\_t = f\_t c\_{t-1} + i\_t \text{tanh}(L(x\_t, h\_{t-1})) $$ $$ o\_t = \text{sigmoid}(L(x\_t, h\_{t-1}, c\_{t})) $$ $$ h\_t = o\_t \text{tanh}(c\_t) $$ --- # Conv-LSTM To switch from regular LSTM to conv LSTM, we replace the matrix multiplications in the LSTM by convolutions. This has been introduced for forecasting of precipitation. According to [[1]](https://arxiv.org/pdf/1506.04214v2.pdf): - ConvLSTM is better than FC-LSTM in handling spatiotemporal correlations. - Making the size of state-to-state convolutional kernel bigger than 1 is essential for capturing the spatiotemporal motion patterns. - Deeper models can produce better results with fewer parameters. - ConvLSTM performs better than ROVER for precipitation nowcasting. --- # 3D self-attention As we saw previously in class, self-attention modules have been used in order to improve the performance of neural networks (transfirmers). How do we adapt the self-attention modules in order to handle videos? -- The standard approach consists in simply use patches that correspond to per element of the sequence of image. --- # Embracing Consistency .center[
] Visual grounding is an essential multi-modal task that aims to localize the object of interest in an image/video based on a text description ([source](https://openreview.net/pdf?id=NzFtM5Pzvm)). --- # Segmenting Moving Objects .center[
] The model is a U-Net architecture with a Transformer bottleneck ([source](https://openreview.net/pdf?id=tUH1Or4xblM)). --- # Optical Flow .center[
] Given a video sequence, we measure the pixels displacement from frame to frame. --- # ST-Adapter This [article](https://openreview.net/pdf?id=uRTW_PgXvc7) rises a very good question: what is the minimum that we need to actually go from per-iamge inference to actual video inference. -- .center[
] This answer is not much... --- # VOT Some problems are by nature video related, for instance visual object tracking (VOT) .center[
] --- What is visual object tracking? -- .center[
] Given a single cropped image of the target we want to follow it in a video. --- # Siamese Networks Siamese netwroks are the standard architectures for VOT .center[
] -- Such network are efficient at learning to measure similarities between objects and can be exploited for other tasks such as face recognition. --- class: middle, center # No upcoming practical sessions