Overview of Self-Supervised Learning for Videos which deal with temporal redundancy and information leakage leading to more generalized models with fewer compute requirements for training video-based models significantly.
Self-supervised learning has emerged as a good alternative to supervised learning in recent years. It has been shown to beat Supervised Learning on Image Classification benchmarks and is a great option in cases where annotating/labeling is too expensive. However, its impact and performance on videos still need to be investigated since videos have an inherent multidimensional nature and complexity since they have both spatial and temporal dimensions.
In this article, we will briefly overview Masked Autoencoders as applied to images in a self-supervised setting, discuss why videos need special attention, and review the VideoMAE architecture and its follow-up work.
The ImageMAE architecture introduced by He et al. in Masked Autoencoders Are Scalable Vision Learners, 2022 was inspired by the success of Masked Modelling in NLP and was based on a straightforward idea:
An image is converted into a set of non-overlapping patches and masked randomly. The visible subset of patches are fed into an encoder which projects them into a latent representation space. A lightweight decoder then operates on these latent representations and the masked tokens to reconstruct the original image.
Two key design principles of this approach are:
Reconstruction Loss: The decoder is tasked with reconstructing the input image, and therefore predicts pixel values for each masked patch. Thus a natural loss formulation emerges where we compute the Mean Squared Error (MSE) between the reconstructed images and the original images in the pixel space. The authors also report improved performance when normalizing the predicted pixel values of each masked patch.
Moreover, based on extensive studies the authors find that this method works well even with high masking proportions (>75%) and improves efficiency in training high-capacity models that generalize well. The authors use Vision Transformers (ViTs) as the encoders. Because the encoder only operates on visible patches of the image (~25% of the original image) this enables them to train very large encoders with only a fraction of compute and memory.
Videos are often densely captured with a high refresh rate and therefore their semantics vary slowly over time. This phenomenon termed Temporal Redundancy leads to issues when applying masked modeling to videos.
Videos can be seen as the evolution of a scene over time with correspondence between the consecutive frames. This correlation leads to information leakage during the reconstruction process.
Thus for a given masked part of the video (termed cube), it becomes easy to find an unmasked highly correlated copy in adjacent frames. This might lead to the model learning “shortcut” features that don’t generalize to new scenes.
To overcome these unique properties of applying Masked Modeling techniques to videos Tong et al. introduced a new method in VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, 2022. VideoMAE is a simple strategy that not only effectively increases the pre-training performance but also greatly reduces the computational cost due to the asymmetric encoder-decoder architecture. The models pre-trained with VideoMAE significantly outperform those trained from scratch or pre-trained with contrastive learning methods.
This is a simple extension of ImageMAE but with two key properties:
The following Gradio Space visualises the masking process of VideoMAE
The authors use a Joint Space-Time Attention policy to learn representations across tokens. However, this has a downside since similarity is computed for all pairs of tokens which is computationally costly due to the large number of patches in a video clip.
Instead of applying attention to the spatial domain within each frame, the authors use joint space-time attention to learn temporal dependencies across frames.
Several follow-up works have investigated the impact of space-time attention variants on video understanding tasks. Bertasius et al. in Is Space-Time Attention All You Need for Video Understanding?, 2021 proposed a more efficient architecture for spatiotemporal attention, termed Divided Space-Time Attention where temporal attention and spatial attention are separately applied one after the other.
The authors conduct extensive experiments and report that the VideoMAE architecture is a data-efficient learner for Self-Supervised Video Pre-training. Notably, even with only 3.5k training clips, VideoMAE obtains satisfying accuracy on the HMDB51 dataset, thus proving its effectiveness on limited data.
In a follow-up work, Wang et al. propose a dual masking strategy for VideoMAE in VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking, 2023. They further increase the efficiency of the VideoMAE model by applying a masking map to the decoder as well. The model then learns to reconstruct a subset of pixel cubes selected by the running cell masking.
This better enables large-scale VideoMAE pre-training under a limited computational budget by using the decoder mask to reduce the decoder input length for high efficiency yet attain similar information to the full reconstruction.
As opposed to VideoMAE which requires pre-training individual models specific to each dataset, the authors aim to learn a universal pre-trained model that could be transferred to different downstream tasks.
Huang et al. in MGMAE: Motion Guided Masking for Video Masked Autoencoding, 2023 introduced a new motion-guided masking strategy that explicitly incorporates motion information to build temporal consistent masking volume. This is based on the insight that motion is a general and unique prior in the video, which should be taken into account during masked pre-training.
The optical flow representation explicitly encodes the movement of each pixel from the current frame to the next one. This is then used to align masking maps between adjacent frames to build consistent masking volumes across time. In particular, the authors use an online and lightweight optical flow estimator to capture motion information.
Firstly, a masking map is randomly generated at the base frame (by default, middle frame). Estimated optical flow is then used to warp the initial masking map to adjacent frames. As a result of multiple warping operations, a temporally consistent masking volume is built for all frames in the video. Based on this masking volume, a set of visible tokens to the MAE encoder with top-k selection is sampled based on a frame-wise manner.
With improved accuracy, MGMAE proves to be a more effective video representation learner. It benefits greatly from the harder task constructed with the motion-guided masking strategy.
Ren et al. in ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning, 2024 address the limitations of cube embeddings by proposing an autoregressive method termed ARVideo. Cube embeddings such as those adopted by VideoMAE and MGMAE often fail to encapsulate the rich semantics of the video. This is primarily because:
To address these limitations, the authors propose a novel autoregressive paradigm with two key design elements:
They extend the Generative Pretrained Transformer (GPT) framework which autoregressively predicts the next element given all preceding ones by minimizing the negative log-likelihood wrt model parameters. However, simply extending this framework to videos faces significant challenges, primarily due to the added temporal dimension. Moreover, pixels as autoregressive elements lack semantic richness compared to words in the language, further necessitating pixel grouping strategies to enhance representation learning.
In ARVideo we strategically group spatially neighbored tokens into spatial clusters and temporally adjacent into temporal clusters. These video tokens are then grouped into spatiotemporal clusters with no overlaps. They also apply a random rasterization approach that scrambles the order of clusters randomly during autoregressive pretraining. Such flexibility in autoregressive prediction orders not only captures the inherent multidimensionality of video data more effectively but also fosters a richer, more comprehensive video representation.
When trained with the ViT-B backbone, ARVideo competitively attains 81.2% on Kinetics-400 while demonstrating higher training efficiency. ARVideo trains 14% faster and requires 58% less GPU memory compared to VideoMAE.
In conclusion, self-supervised learning for video understanding has made significant strides in recent years, addressing the unique challenges posed by the multidimensional nature of video data. From the foundational work of VideoMAE to innovative approaches like VideoMAEv2, MGMAE, and ARVideo, researchers have tackled issues such as temporal redundancy, information leakage, and the need for more efficient and effective representation learning.
The methods presented in this post demonstrate how self-supervised learning can be adapted to the video domain by employing strategies that exploit the spatio-temporal structure of videos. This not only results in more generalized models but also reduces the compute requirements for training video-based models significantly.
Saurav,
Machine Learning Advocate Engineer
lightly.ai
Get exclusive insights, tips, and updates from the Lightly.ai team.