+ - 0:00:00
Notes for current slide
Notes for next slide

From Images to Video

Edouard Yvinec

Sorbonne Isir Datakalab Epita
1 / 25

This class is mostly based on articles from NeurIPS 2022. But before we get into the details, we have to introduce the base tools for video processing.

  • pre-processing
  • 3D convolutions
  • Conv-LSTM
  • 3D self-attention
2 / 25

Pre-Processing

The pre-processing is usually the same as the pre-processing for per-image predictions

  • use the image as such
3 / 25

Pre-Processing

The pre-processing is usually the same as the pre-processing for per-image predictions

  • use the image as such
  • use the scaled image in [0;1] or [1;1]
4 / 25

Pre-Processing

The pre-processing is usually the same as the pre-processing for per-image predictions

  • use the image as such
  • use the scaled image in [0;1] or [1;1]
  • use the normalized iamge using a mean and variance per-channel
5 / 25

2D convolutions

We studied 2D convolutions which take an image (2D object) and apply a kernel to it.

6 / 25

3D convolutions

We can use 3D convolutions which are applied to volumes such as a video.

Given a video sequence of 20 squared images of 224 pixels in height. The output wille be a

7 / 25

3D convolutions

We can use 3D convolutions which are applied to volumes such as a video.

Given a video sequence of 20 squared images of 224 pixels in height. The output wille be a video sequence as well. Assume the output video sequence contains 15 images. Considering a 3×3 convolution, how many paramters does this layer have?

8 / 25

3D convolutions

We can use 3D convolutions which are applied to volumes such as a video.

Given a video sequence of 20 squared images of 224 pixels in height. The output wille be a video sequence as well. Assume the output video sequence contains 15 images. Considering a 3×3 convolution, how many paramters does this layer have? (20×3)×(3×3)×(15×3)=24300

9 / 25

LSTM

We recall what an LSTM cell is:

10 / 25

LSTM

We recall what an LSTM cell is:

We need to deinfe the gates i (input gate), f the forget gate and o the output gate. We note L a linear transformation, c the cell outputs and h the hidden state. it=sigmoid(L(xt,ht1,ct1)) ft=sigmoid(L(xt,ht1,ct1)) ct=ftct1+ittanh(L(xt,ht1)) ot=sigmoid(L(xt,ht1,ct)) ht=ottanh(ct)

11 / 25

Conv-LSTM

To switch from regular LSTM to conv LSTM, we replace the matrix multiplications in the LSTM by convolutions. This has been introduced for forecasting of precipitation. According to [1]:

  • ConvLSTM is better than FC-LSTM in handling spatiotemporal correlations.
  • Making the size of state-to-state convolutional kernel bigger than 1 is essential for capturing the spatiotemporal motion patterns.
  • Deeper models can produce better results with fewer parameters.
  • ConvLSTM performs better than ROVER for precipitation nowcasting.
12 / 25

3D self-attention

As we saw previously in class, self-attention modules have been used in order to improve the performance of neural networks (transfirmers).

How do we adapt the self-attention modules in order to handle videos?

13 / 25

3D self-attention

As we saw previously in class, self-attention modules have been used in order to improve the performance of neural networks (transfirmers).

How do we adapt the self-attention modules in order to handle videos?

The standard approach consists in simply use patches that correspond to per element of the sequence of image.

14 / 25

Embracing Consistency

Visual grounding is an essential multi-modal task that aims to localize the object of interest in an image/video based on a text description (source).

15 / 25

Segmenting Moving Objects

The model is a U-Net architecture with a Transformer bottleneck (source).

16 / 25

Optical Flow

Given a video sequence, we measure the pixels displacement from frame to frame.

17 / 25

ST-Adapter

This article rises a very good question: what is the minimum that we need to actually go from per-iamge inference to actual video inference.

18 / 25

ST-Adapter

This article rises a very good question: what is the minimum that we need to actually go from per-iamge inference to actual video inference.

This answer is not much...

19 / 25

VOT

Some problems are by nature video related, for instance visual object tracking (VOT)

20 / 25

What is visual object tracking?

21 / 25

What is visual object tracking?

Given a single cropped image of the target we want to follow it in a video.

22 / 25

Siamese Networks

Siamese netwroks are the standard architectures for VOT

23 / 25

Siamese Networks

Siamese netwroks are the standard architectures for VOT

Such network are efficient at learning to measure similarities between objects and can be exploited for other tasks such as face recognition.

24 / 25

No upcoming practical sessions

25 / 25

This class is mostly based on articles from NeurIPS 2022. But before we get into the details, we have to introduce the base tools for video processing.

  • pre-processing
  • 3D convolutions
  • Conv-LSTM
  • 3D self-attention
2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow