Foundational models have become a cornerstone of modern deep learning.
While pretraining language models for natural language processing have more or less become standardized, pretraining models with images to generate general-purpose visual features has proven to be challenging in the past.
In this post we’ll cover:
The Original DINO and its impact: DINOv2 vs. DINO
What is DINOv2 and how it works
Training process and data pipeline
Why DINOv2 matters
LightlyTrain + DinoV2 in action
Challenges and Limitations of DINOv2
We shall also look into model pretraining models and provide code samples.
Pro tip: If you're looking to experiment with self-supervised learning or pretrain your own foundation models, LightlyTrain makes it easy to get started—with support for methods like DINOv2 out of the box.
The Original DINO and Its Impact: DINOv2 vs. DINO
Vision Transformers (ViT) are now known to be the standard backbones for almost all vision tasks, however this was not always the case. Compared to traditional methods like convolutional neural networks, vision transformers are more computationally demanding and data hungry. Moreover with classical methods from the self supervised literature the use of vision transformers over convolutional neural networks didn’t offer any significant improvements or unique properties.
The original DINO (Knowledge Distillation with No Labels) paper studied the impact of Self-Supervised pretraining on ViT features and questioned whether the muted success of transformers in vision can be explained by the use of supervision in their pretraining.
DINO proposed a simple training algorithm in which using a Knowledge Distillation setup one can predict the outputs of a teacher network (with a momentum encoder as per MoCo) using a standard cross-entropy loss.
Figure 1: The original DINO.
Let us briefly review the technical contributions of the DINO framework:
DINO employs Knowledge Distillation as the base training paradigm. In Knowledge Distillation one trains a student network to match the output of a given teacher network, with the idea that training a smaller model to mimic the output of a larger network compresses the model by using the soft probabilities from the teacher network that contain a lot more information compared to simple class labels.
These soft probabilities from the teacher network care then centered over the mean and softened using temperature scaling.
Moreover DINO employed a form of dynamic distillation wherein the teacher is dynamically built during training. Only the parameters of the student model are learnt during training while those of the teacher are constructed using an exponential moving average over past iterates.
DINO was trained using a composite loss wherein apart from the standard image classification loss, a cross entropy loss was also employed to align the teacher and student output probabilities.
What Is DINOv2 and how it works
DINOv2 (Learning Robust Visual Features without Supervision) extends this study by asking if Self-Supervised Learning has the potential to learn general purpose visual features if pretrained on a large quantity of curated data.
They combine the DINO framework with existing approaches like iBOT and SwAV to design a more stable and accelerating discriminative training algorithm that excels when scaled in data and model sizes. They report 2x faster training times and 3x fewer memory consumption allowing them to leverage longer training with larger batch sizes to learn robust visual features.
Architecture
DINOv2 differs from the original in the following ways:
Patch Level Objective: A direct translation of Masked Language Modelling is Masked Image Modelling (MIM) wherein random patches from an image are masked and the model is trained to fill in the patches. DINOv2 employs this technique in a Knowledge Distillation setting as inspired from iBOT. Some input patches to the student network are randomly masked while the teacher is fed the visible tokens. While some previous works have mentioned that sharing parameters between the DINO and iBOT heads leads to better performance, the authors note that at scale the opposite is true. Therefore they employ separate DINO and iBOT heads.
Sinkhorn-Knopp centering from SwAV: In a recent ICLR paper by Ruan et al. it was empirically recommended to replace the teacher softmax + centering step of DINO and iBOT by the Sinkhorn-Knopp batch normalization technique from SwAV. Since DINO employs a form of Dynamic/Online Knowledge Distillation the standard Sinkhorn-Knopp algorithm from optimal transport literature doesn’t work in this case. Therefore, the online variant as proposed in SwAV is used. For a full overview of the algorithm please refer to the original paper.
KoLeo regularizer: The authors draw inspiration from the differential entropy literature and use the KoLo regularizer. This regularizer has also been used in the similarity search literature to generate a uniformly distributed embedding space (For more details refer to Spreading vectors for similarity search).
The authors also note that the key to increasing performance on pixel level downstream tasks such as segmentation or detection where small objects tend to disappear at lower resolutions is to use larger image resolutions.
However instead of using larger image sizes all throughout training thereby resulting in longer runs and memory intensive training, they increase the resolution during a short period at the end of training.
Training Process and Data Pipeline
Let’s take a closer look at how DINOv2 was trained—from assembling a massive, diverse dataset to leveraging a fully self-supervised training pipeline without relying on text labels.
Training Data Scale and Sources: The authors assembled their own diverse dataset LVD-142M by retrieving images from a “large pool of uncurated data” but close to those in curated datasets such as ImageNet-22k and Google landmarks. These uncurated data sources are publicly available repositories of crawled web data. In particular they extract URL links of images from the HTML <img> tag.
Automatic Data Curation Pipeline: During curating the image dataset the authors discard image URLs that are unsafe or restricted by domains. Moreover they pre-process using standard tricks such as PCA hash deduplication, NSFW filtering and blurring identifiable faces. They also employ the copy detection pipeline of SSCD (from Self-Supervised Descriptor for Image Copy Detection) to remove near duplicate images. It has been noted that a major problem in dealing with images from the wild is to rebalance concepts and avoid overfitting on a few dominant modes.
Self-Supervised Training (No Labels): Most traditional methods for pretraining foundational models for vision use a form of text-guided pretraining i.e. a form of textual supervision using the captions to guide the training of image features. This form of pretraining limits the information that can be retained during training since image captions only weakly approximate the information present in an image. Moreover pixel level information is not used during text-guided training. These methods also require image encoders that align the texts with images. Self-Supervised training on the other hand uses data similarities between images instead of external metadata in the form of captions. Thus one can avoid manual image annotation all together and use raw data alone allowing flexibility over text counterparts.
Comparison: Self-Supervised vs. Weakly-Supervised (CLIP) Features: The authors also validated the performance of DINOv2 with SOTA weakly supervised models. They report that DINOv2 surpasses OpenCLIP and EVA-CLIP on linear evaluation performance using large ViT models. Moreover they report a higher score on test sets suggesting better generalization through robust visual features.
Figure 2: Weakly supervised vs. Self-supervised.
Frozen Features and Versatile Outputs: pretraining of such form is great at learning transferable “frozen features” i.e. once pretrained we can use such features on different tasks since they act as a reliable backbone. Thus one can simply put a linear probe on top of these frozen features and use them for various tasks.
Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
DINOv2 is a self-supervised Vision Transformer by Meta AI that learns general-purpose visual features from unlabeled images. Trained on 142M images using self-distillation and large-scale infrastructure, it achieves state-of-the-art results across tasks like classification, segmentation, and depth estimation—without any fine-tuning or labels.
Ideal For:
ML Engineers
Reading time:
10 mins
Category:
Models
Share blog post
DINOv2 is Meta AI’s 2023 self-supervised Vision Transformer model. It learns powerful visual features from 142M unlabeled images. These features work across many vision tasks—no fine-tuning needed.
TL;DR
What is DINOv2? DINOv2 is a self-supervised Vision Transformer model (developed by Meta AI in 2023) that learns robust visual features from unlabeled images. It’s essentially a foundation model for computer vision producing universal image features that can be used across many tasks without fine-tuning.
What does DINOv2 improve? DINOv2 builds on the original DINO method by scaling up training and architecture – it was trained on 142 million diverse images (vs. smaller datasets before) and includes new training innovations. Its features achieve state-of-the-art performance on various vision benchmarks without requiring labeled data or task-specific fine-tuning, surpassing even some supervised or text-supervised approaches.
How is DINOv2 trained? It uses a self-distillation approach: a large Vision Transformer (ViT) teacher model (with ~1 billion parameters) is first trained on the images, then a student ViT is trained to match the teacher’s outputs (knowledge distillation). Training leveraged massive scale (fully sharded training across many GPUs, FlashAttention for efficient ViT attention, huge batch sizes ~65k) and an automatic data pipeline to curate a balanced dataset from uncurated web images.
What tasks does DINOv2 excel at? Because it learns general-purpose visual features, DINOv2 works impressively well on a wide range of vision tasks: image classification (competitive ImageNet accuracy with just a linear classifier), object detection, semantic segmentation, monocular depth estimation, image/instance retrieval, and even transfer to video understanding tasks. Notably, it produces high-quality segmentation maps and depth predictions without any supervised training, often matching or beating task-specific models. In short, DINOv2’s pre-trained features can be plugged into numerous computer vision tasks and deliver top-tier results.
Foundational models have become a cornerstone of modern deep learning.
While pretraining language models for natural language processing have more or less become standardized, pretraining models with images to generate general-purpose visual features has proven to be challenging in the past.
In this post we’ll cover:
The Original DINO and its impact: DINOv2 vs. DINO
What is DINOv2 and how it works
Training process and data pipeline
Why DINOv2 matters
LightlyTrain + DinoV2 in action
Challenges and Limitations of DINOv2
We shall also look into model pretraining models and provide code samples.
Pro tip: If you're looking to experiment with self-supervised learning or pretrain your own foundation models, LightlyTrain makes it easy to get started—with support for methods like DINOv2 out of the box.
The Original DINO and Its Impact: DINOv2 vs. DINO
Vision Transformers (ViT) are now known to be the standard backbones for almost all vision tasks, however this was not always the case. Compared to traditional methods like convolutional neural networks, vision transformers are more computationally demanding and data hungry. Moreover with classical methods from the self supervised literature the use of vision transformers over convolutional neural networks didn’t offer any significant improvements or unique properties.
The original DINO (Knowledge Distillation with No Labels) paper studied the impact of Self-Supervised pretraining on ViT features and questioned whether the muted success of transformers in vision can be explained by the use of supervision in their pretraining.
DINO proposed a simple training algorithm in which using a Knowledge Distillation setup one can predict the outputs of a teacher network (with a momentum encoder as per MoCo) using a standard cross-entropy loss.
Figure 1: The original DINO.
Let us briefly review the technical contributions of the DINO framework:
DINO employs Knowledge Distillation as the base training paradigm. In Knowledge Distillation one trains a student network to match the output of a given teacher network, with the idea that training a smaller model to mimic the output of a larger network compresses the model by using the soft probabilities from the teacher network that contain a lot more information compared to simple class labels.
These soft probabilities from the teacher network care then centered over the mean and softened using temperature scaling.
Moreover DINO employed a form of dynamic distillation wherein the teacher is dynamically built during training. Only the parameters of the student model are learnt during training while those of the teacher are constructed using an exponential moving average over past iterates.
DINO was trained using a composite loss wherein apart from the standard image classification loss, a cross entropy loss was also employed to align the teacher and student output probabilities.
What Is DINOv2 and how it works
DINOv2 (Learning Robust Visual Features without Supervision) extends this study by asking if Self-Supervised Learning has the potential to learn general purpose visual features if pretrained on a large quantity of curated data.
They combine the DINO framework with existing approaches like iBOT and SwAV to design a more stable and accelerating discriminative training algorithm that excels when scaled in data and model sizes. They report 2x faster training times and 3x fewer memory consumption allowing them to leverage longer training with larger batch sizes to learn robust visual features.
Architecture
DINOv2 differs from the original in the following ways:
Patch Level Objective: A direct translation of Masked Language Modelling is Masked Image Modelling (MIM) wherein random patches from an image are masked and the model is trained to fill in the patches. DINOv2 employs this technique in a Knowledge Distillation setting as inspired from iBOT. Some input patches to the student network are randomly masked while the teacher is fed the visible tokens. While some previous works have mentioned that sharing parameters between the DINO and iBOT heads leads to better performance, the authors note that at scale the opposite is true. Therefore they employ separate DINO and iBOT heads.
Sinkhorn-Knopp centering from SwAV: In a recent ICLR paper by Ruan et al. it was empirically recommended to replace the teacher softmax + centering step of DINO and iBOT by the Sinkhorn-Knopp batch normalization technique from SwAV. Since DINO employs a form of Dynamic/Online Knowledge Distillation the standard Sinkhorn-Knopp algorithm from optimal transport literature doesn’t work in this case. Therefore, the online variant as proposed in SwAV is used. For a full overview of the algorithm please refer to the original paper.
KoLeo regularizer: The authors draw inspiration from the differential entropy literature and use the KoLo regularizer. This regularizer has also been used in the similarity search literature to generate a uniformly distributed embedding space (For more details refer to Spreading vectors for similarity search).
The authors also note that the key to increasing performance on pixel level downstream tasks such as segmentation or detection where small objects tend to disappear at lower resolutions is to use larger image resolutions.
However instead of using larger image sizes all throughout training thereby resulting in longer runs and memory intensive training, they increase the resolution during a short period at the end of training.
Training Process and Data Pipeline
Let’s take a closer look at how DINOv2 was trained—from assembling a massive, diverse dataset to leveraging a fully self-supervised training pipeline without relying on text labels.
Training Data Scale and Sources: The authors assembled their own diverse dataset LVD-142M by retrieving images from a “large pool of uncurated data” but close to those in curated datasets such as ImageNet-22k and Google landmarks. These uncurated data sources are publicly available repositories of crawled web data. In particular they extract URL links of images from the HTML <img> tag.
Automatic Data Curation Pipeline: During curating the image dataset the authors discard image URLs that are unsafe or restricted by domains. Moreover they pre-process using standard tricks such as PCA hash deduplication, NSFW filtering and blurring identifiable faces. They also employ the copy detection pipeline of SSCD (from Self-Supervised Descriptor for Image Copy Detection) to remove near duplicate images. It has been noted that a major problem in dealing with images from the wild is to rebalance concepts and avoid overfitting on a few dominant modes.
Self-Supervised Training (No Labels): Most traditional methods for pretraining foundational models for vision use a form of text-guided pretraining i.e. a form of textual supervision using the captions to guide the training of image features. This form of pretraining limits the information that can be retained during training since image captions only weakly approximate the information present in an image. Moreover pixel level information is not used during text-guided training. These methods also require image encoders that align the texts with images. Self-Supervised training on the other hand uses data similarities between images instead of external metadata in the form of captions. Thus one can avoid manual image annotation all together and use raw data alone allowing flexibility over text counterparts.
Comparison: Self-Supervised vs. Weakly-Supervised (CLIP) Features: The authors also validated the performance of DINOv2 with SOTA weakly supervised models. They report that DINOv2 surpasses OpenCLIP and EVA-CLIP on linear evaluation performance using large ViT models. Moreover they report a higher score on test sets suggesting better generalization through robust visual features.
Figure 2: Weakly supervised vs. Self-supervised.
Frozen Features and Versatile Outputs: pretraining of such form is great at learning transferable “frozen features” i.e. once pretrained we can use such features on different tasks since they act as a reliable backbone. Thus one can simply put a linear probe on top of these frozen features and use them for various tasks.
See Lightly in Action
Curate data, train foundation models, deploy on edge today.
How does DINOv2 stack up against other prominent foundation models in computer vision? Here we compare it with a few key models: CLIP (OpenAI’s image-text model), MAE (Masked Autoencoder), and iBOT (a previous self-supervised ViT method), among others. Each of these approaches has a different training strategy and thus different strengths. We summarize the comparisons in the table below and discuss the highlights:
Table 1: A comparison between DINOv2 and Other Foundation Models.
Fine-tuning required to fully leverage features (poor linear probe)
- 87% ImageNet fine-tuned accuracy (ViT-L) (excellent when fine-tuned)
- Lower linear accuracy (~67%) – features not directly ready for tasks
- Great at low-level reconstruction, less semantic out-of-box
Fine-tuning helps but also good linear performance for its time
- ~84% ImageNet linear acc (ViT-L on IN-22k)
- Introduced masked patch prediction with DINO-style loss
- DINOv2 built upon iBOT (adding data+tricks to surpass it)
Why DINOv2 Matters
General-Purpose “Out of the Box” Model: DINOv2 is an important model since it’s a general purpose whose “frozen features” can be used for any number of tasks such as image retrieval or depth estimation.
Strong Transfer to Downstream Tasks: The authors report strong performance when using a linear probe and domain generalization. Moreover, compared to other self-supervised learning methods DINOv2 performs strongly on semantic segmentation, depth estimation and out-of-domain generalization.
No Labels = More Flexibility: Since DINOv2 was trained using raw image samples with no labels or captions, its use is not limited to image classification. The features generated by this model can be used for classification on custom labels or for video recognition using the same model.
Robust Semantic and Visual Features: Compared to fine tuned model architectures, DINOv2 performs at par on semantic segmentation by greatly simplifying the model architecture. Moreover the authors report that DINOv2 frozen features offer superior performance compared to SOTA Self Supervised Learning and weakly supervised features suggesting robust understanding
Reduction in Fine-Tuning Costs: DINOv2 combines a lot of tricks learnt from other SSL methods and is more stable at scale. This allows for 2x faster and 3x lower memory consumption enabling longer training runs with larger batch sizes.
LightlyTrain + DinoV2 in action
Sounds interesting ? Want to pre-train a DINOv2 model on your own dataset?
Thanks to LighltyTrain you can pre-train a DINOv2 model in 4-5 lines of code!
LightlyTrain supports a large number of models from various libraries as well. For a full list please refer to the Supported Models section of the documentation. Using DINOv2 in LightlyTrain uses a set of standard image augmentations which can be found here.
LightlyTrain also supports the classical DINO paradigm which you can use by specifying the method as “dino”. Want to try another classical Self-Supervised paradigm? LightlyTrain also supports SimCLR. For a detailed guide on which method might be the best for you please refer to the following guide.
Figure 4: LightlyTrain loss.
Another feature of LightlyTrain definitely worth considering is the inbuilt support for logging platforms such as Tensorboard and Weights and Biases. LightlyTrain supports the following loggers:
jsonl
Tensorboard
Weights & Biases
Let’s look into using Weights & Biases as the logging platform of choice. You can specify the logger here using the “loggers” argument.
For more details please refer to the loggers section of the LightlyTrain docs.
You can also find a video guide on using LightlyTrain to pretrain a model using the DINOv2 here.
Challenges and Limitations
While DINOv2 is a major leap forward, it’s not without limitations and challenges. It’s important to recognize these, both to set proper expectations and to guide future improvements:
When evaluated on geographical fairness the authors report a drop of 25.7% when using the Dollar Street dataset on regions in Africa when compared to Europe suggesting that the model is still biased towards western countries.
DINOv2 also performs better on high-income households compared to low-income households with a difference of 31.7% on the Dollar Street dataset.
Conclusion
In this article we went over the DINOv2 framework as introduced in the paper Learning Robust Visual Features without Supervision by Oquab et al. Building on top of their previous work with DINO (Emerging Properties in Self-Supervised Vision Transformers) they combine contributions from different techniques such as iBOT and SwAV to accelerate and stabilize training self-supervised foundational models for vision. The authors also proposed an automatic pipeline for data gathering and curation to build a diverse curated image dataset.
As opposed to the text-guided pre-training pipeline followed by prior works which relied on external text metadata such as labels or captions to approximate the information in the image, the authors develop a discriminative self-supervised technique which works with raw image samples. This alleviates the need for image encoders which are traditionally used to align text-image pairs thereby allowing for lesser memory consumption.
The authors focus on enabling stable training at scale and report how the DINOv2 family of models drastically improves over previous SOTA Self-Supervised methods and Weakly Supervised Models. For faster training times, the authors implemented their own versions of Flash Attention to improve memory usage and speed on self-attention layers. Moreover they parameterise the models slightly differently to optimise performance on their hardware. They also implemented an improved version of stochastic depth which skips the computation of residuals rather than masking. These efficient implementations allow them to further improve memory utilisation.
DINOv2 thus presents itself as a self supervised model for generating all purpose features with remarkable properties such as an understanding of object parts and scene geometry, Moreover since DINOv2 offers high quality frozen features which can be used with classifiers as simple linear layers it readily offers itself to be used for any number of downstream computer vision tasks. This further shows that with enough curated data and a functional distillation pipeline one can improve existing pretraining methods to train larger models which don't require fine tuning to generate features that reach performance comparable to large bit models trained with traditional supervised methods. Moreover compared to similar foundation models these features are better in both the quality and resulting performance.
DINOv2 thus presents itself as a strong backbone for vision tasks with interesting properties such as an understanding of object parts and scene geometry, Moreover since DINOv2 offers high quality frozen features which can be used with classifiers as simple linear layers it readily offers itself to be used for any number of downstream tasks.
Get Started with Lightly
Talk to Lightly’s computer vision team about your use case.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.