DINOv2 is a self-supervised Vision Transformer by Meta AI that learns general-purpose visual features from unlabeled images. Trained on 142M images using self-distillation and large-scale infrastructure, it achieves state-of-the-art results across tasks like classification, segmentation, and depth estimation—without any fine-tuning or labels.
DINOv2 is Meta AI’s 2023 self-supervised Vision Transformer model. It learns powerful visual features from 142M unlabeled images. These features work across many vision tasks—no fine-tuning needed.
DINOv2 is a self-supervised Vision Transformer model (developed by Meta AI in 2023) that learns robust visual features from unlabeled images. It’s essentially a foundation model for computer vision producing universal image features that can be used across many tasks without fine-tuning.
DINOv2 builds on the original DINO method by scaling up training and architecture – it was trained on 142 million diverse images (vs. smaller datasets before) and includes new training innovations. Its features achieve state-of-the-art performance on various vision benchmarks without requiring labeled data or task-specific fine-tuning, surpassing even some supervised or text-supervised approaches.
It uses a self-distillation approach: a large Vision Transformer (ViT) teacher model (with ~1 billion parameters) is first trained on the images, then a student ViT is trained to match the teacher’s outputs (knowledge distillation). Training leveraged massive scale (fully sharded training across many GPUs, FlashAttention for efficient ViT attention, huge batch sizes ~65k) and an automatic data pipeline to curate a balanced dataset from uncurated web images.
Because it learns general-purpose visual features, DINOv2 works impressively well on a wide range of vision tasks: image classification (competitive ImageNet accuracy with just a linear classifier), object detection, semantic segmentation, monocular depth estimation, image/instance retrieval, and even transfer to video understanding tasks. Notably, it produces high-quality segmentation maps and depth predictions without any supervised training, often matching or beating task-specific models. In short, DINOv2’s pre-trained features can be plugged into numerous computer vision tasks and deliver top-tier results.
Foundational models have become a cornerstone of modern deep learning.
While pretraining language models for natural language processing have more or less become standardized, pretraining models with images to generate general-purpose visual features has proven to be challenging in the past.
In this post we’ll cover:
We shall also look into model pretraining models and provide code samples.
Pro tip: If you're looking to experiment with self-supervised learning or pretrain your own foundation models, LightlyTrain makes it easy to get started—with support for methods like DINOv2 out of the box.
Vision Transformers (ViT) are now known to be the standard backbones for almost all vision tasks, however this was not always the case. Compared to traditional methods like convolutional neural networks, vision transformers are more computationally demanding and data hungry. Moreover with classical methods from the self supervised literature the use of vision transformers over convolutional neural networks didn’t offer any significant improvements or unique properties.
The original DINO (Knowledge Distillation with No Labels) paper studied the impact of Self-Supervised pretraining on ViT features and questioned whether the muted success of transformers in vision can be explained by the use of supervision in their pretraining.
DINO proposed a simple training algorithm in which using a Knowledge Distillation setup one can predict the outputs of a teacher network (with a momentum encoder as per MoCo) using a standard cross-entropy loss.
Let us briefly review the technical contributions of the DINO framework:
DINOv2 (Learning Robust Visual Features without Supervision) extends this study by asking if Self-Supervised Learning has the potential to learn general purpose visual features if pretrained on a large quantity of curated data.
They combine the DINO framework with existing approaches like iBOT and SwAV to design a more stable and accelerating discriminative training algorithm that excels when scaled in data and model sizes. They report 2x faster training times and 3x fewer memory consumption allowing them to leverage longer training with larger batch sizes to learn robust visual features.
Architecture
DINOv2 differs from the original in the following ways:
The authors also note that the key to increasing performance on pixel level downstream tasks such as segmentation or detection where small objects tend to disappear at lower resolutions is to use larger image resolutions.
However instead of using larger image sizes all throughout training thereby resulting in longer runs and memory intensive training, they increase the resolution during a short period at the end of training.
Let’s take a closer look at how DINOv2 was trained—from assembling a massive, diverse dataset to leveraging a fully self-supervised training pipeline without relying on text labels.
Frozen Features and Versatile Outputs: pretraining of such form is great at learning transferable “frozen features” i.e. once pretrained we can use such features on different tasks since they act as a reliable backbone. Thus one can simply put a linear probe on top of these frozen features and use them for various tasks.
How does DINOv2 stack up against other prominent foundation models in computer vision? Here we compare it with a few key models: CLIP (OpenAI’s image-text model), MAE (Masked Autoencoder), and iBOT (a previous self-supervised ViT method), among others. Each of these approaches has a different training strategy and thus different strengths. We summarize the comparisons in the table below and discuss the highlights:
Why DINOv2 Matters
Sounds interesting ? Want to pre-train a DINOv2 model on your own dataset?
Thanks to LighltyTrain you can pre-train a DINOv2 model in 4-5 lines of code!
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="output",
data="path_to_my_images",
model="torchvision/resnet18",
method="distillation",
)
LightlyTrain supports a large number of models from various libraries as well. For a full list please refer to the Supported Models section of the documentation. Using DINOv2 in LightlyTrain uses a set of standard image augmentations which can be found here.
LightlyTrain also supports the classical DINO paradigm which you can use by specifying the method as “dino”. Want to try another classical Self-Supervised paradigm? LightlyTrain also supports SimCLR. For a detailed guide on which method might be the best for you please refer to the following guide.
Another feature of LightlyTrain definitely worth considering is the inbuilt support for logging platforms such as Tensorboard and Weights and Biases. LightlyTrain supports the following loggers:
Let’s look into using Weights & Biases as the logging platform of choice. You can specify the logger here using the “loggers” argument.
loggers={
"wandb": {
"project": "lightly-train",
"log_model": False,
},
}
For more details please refer to the loggers section of the LightlyTrain docs.
You can also find a video guide on using LightlyTrain to pretrain a model using the DINOv2 here.
While DINOv2 is a major leap forward, it’s not without limitations and challenges. It’s important to recognize these, both to set proper expectations and to guide future improvements:
In this article we went over the DINOv2 framework as introduced in the paper Learning Robust Visual Features without Supervision by Oquab et al. Building on top of their previous work with DINO (Emerging Properties in Self-Supervised Vision Transformers) they combine contributions from different techniques such as iBOT and SwAV to accelerate and stabilize training self-supervised foundational models for vision. The authors also proposed an automatic pipeline for data gathering and curation to build a diverse curated image dataset.
As opposed to the text-guided pre-training pipeline followed by prior works which relied on external text metadata such as labels or captions to approximate the information in the image, the authors develop a discriminative self-supervised technique which works with raw image samples. This alleviates the need for image encoders which are traditionally used to align text-image pairs thereby allowing for lesser memory consumption.
The authors focus on enabling stable training at scale and report how the DINOv2 family of models drastically improves over previous SOTA Self-Supervised methods and Weakly Supervised Models. For faster training times, the authors implemented their own versions of Flash Attention to improve memory usage and speed on self-attention layers. Moreover they parameterise the models slightly differently to optimise performance on their hardware. They also implemented an improved version of stochastic depth which skips the computation of residuals rather than masking. These efficient implementations allow them to further improve memory utilisation.
DINOv2 thus presents itself as a self supervised model for generating all purpose features with remarkable properties such as an understanding of object parts and scene geometry, Moreover since DINOv2 offers high quality frozen features which can be used with classifiers as simple linear layers it readily offers itself to be used for any number of downstream computer vision tasks. This further shows that with enough curated data and a functional distillation pipeline one can improve existing pretraining methods to train larger models which don't require fine tuning to generate features that reach performance comparable to large bit models trained with traditional supervised methods. Moreover compared to similar foundation models these features are better in both the quality and resulting performance.
Get exclusive insights, tips, and updates from the Lightly.ai team.