From Images to Video

Edouard Yvinec

1 / 25

This class is mostly based on articles from NeurIPS 2022. But before we get into the details, we have to introduce the base tools for video processing.

pre-processing
3D convolutions
Conv-LSTM
3D self-attention

2 / 25

Pre-Processing

The pre-processing is usually the same as the pre-processing for per-image predictions

use the image as such

3 / 25

Pre-Processing

The pre-processing is usually the same as the pre-processing for per-image predictions

use the image as such
use the scaled image in or

4 / 25

Pre-Processing

The pre-processing is usually the same as the pre-processing for per-image predictions

use the image as such
use the scaled image in or
use the normalized iamge using a mean and variance per-channel

5 / 25

2D convolutions

We studied 2D convolutions which take an image (2D object) and apply a kernel to it.

6 / 25

3D convolutions

We can use 3D convolutions which are applied to volumes such as a video.

Given a video sequence of 20 squared images of 224 pixels in height. The output wille be a

7 / 25

3D convolutions

We can use 3D convolutions which are applied to volumes such as a video.

Given a video sequence of 20 squared images of 224 pixels in height. The output wille be a video sequence as well. Assume the output video sequence contains 15 images. Considering a convolution, how many paramters does this layer have?

8 / 25

3D convolutions

We can use 3D convolutions which are applied to volumes such as a video.

9 / 25

LSTM

We recall what an LSTM cell is:

10 / 25

LSTM

We recall what an LSTM cell is:

We need to deinfe the gates (input gate), the forget gate and the output gate. We note a linear transformation, the cell outputs and the hidden state.

11 / 25

Conv-LSTM

To switch from regular LSTM to conv LSTM, we replace the matrix multiplications in the LSTM by convolutions. This has been introduced for forecasting of precipitation. According to [1]:

ConvLSTM is better than FC-LSTM in handling spatiotemporal correlations.
Making the size of state-to-state convolutional kernel bigger than 1 is essential for capturing the spatiotemporal motion patterns.
Deeper models can produce better results with fewer parameters.
ConvLSTM performs better than ROVER for precipitation nowcasting.

12 / 25

3D self-attention

As we saw previously in class, self-attention modules have been used in order to improve the performance of neural networks (transfirmers).

How do we adapt the self-attention modules in order to handle videos?

13 / 25

3D self-attention

As we saw previously in class, self-attention modules have been used in order to improve the performance of neural networks (transfirmers).

How do we adapt the self-attention modules in order to handle videos?

The standard approach consists in simply use patches that correspond to per element of the sequence of image.

14 / 25

Embracing Consistency

Visual grounding is an essential multi-modal task that aims to localize the object of interest in an image/video based on a text description (source).

15 / 25

Segmenting Moving Objects

The model is a U-Net architecture with a Transformer bottleneck (source).

16 / 25

Optical Flow

Given a video sequence, we measure the pixels displacement from frame to frame.

17 / 25

ST-Adapter

This article rises a very good question: what is the minimum that we need to actually go from per-iamge inference to actual video inference.

18 / 25

ST-Adapter

This article rises a very good question: what is the minimum that we need to actually go from per-iamge inference to actual video inference.

This answer is not much...

19 / 25

VOT

Some problems are by nature video related, for instance visual object tracking (VOT)

20 / 25

What is visual object tracking?

21 / 25

What is visual object tracking?

Given a single cropped image of the target we want to follow it in a video.

22 / 25

Siamese Networks

Siamese netwroks are the standard architectures for VOT

23 / 25

Siamese Networks

Siamese netwroks are the standard architectures for VOT

Such network are efficient at learning to measure similarities between objects and can be exploited for other tasks such as face recognition.

24 / 25

No upcoming practical sessions25 / 25

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

From Images to Video

Pre-Processing

Pre-Processing

Pre-Processing

2D convolutions

3D convolutions

3D convolutions

3D convolutions

LSTM

LSTM

Conv-LSTM

3D self-attention

3D self-attention

Embracing Consistency

Segmenting Moving Objects

Optical Flow

ST-Adapter

ST-Adapter

VOT

Siamese Networks

Siamese Networks

No upcoming practical sessions

Help