Understanding DINOv2: Engineer's Deep Dive
Foundational models have become a cornerstone of modern deep learning.
While pretraining language models for natural language processing have more or less become standardized, pretraining models with images to generate general-purpose visual features has proven to be challenging in the past.
In this post we’ll cover:
- The Original DINO and its impact: DINOv2 vs. DINO
- What is DINOv2 and how it works
- Training process and data pipeline
- Why DINOv2 matters
- LightlyTrain + DinoV2 in action
- Challenges and Limitations of DINOv2
We shall also look into model pretraining models and provide code samples.
Pro tip: If you're looking to experiment with self-supervised learning or pretrain your own foundation models, LightlyTrain makes it easy to get started—with support for methods like DINOv2 out of the box.
The Original DINO and Its Impact: DINOv2 vs. DINO
Vision Transformers (ViT) are now known to be the standard backbones for almost all vision tasks, however this was not always the case. Compared to traditional methods like convolutional neural networks, vision transformers are more computationally demanding and data hungry. Moreover with classical methods from the self supervised literature the use of vision transformers over convolutional neural networks didn’t offer any significant improvements or unique properties.
The original DINO (Knowledge Distillation with No Labels) paper studied the impact of Self-Supervised pretraining on ViT features and questioned whether the muted success of transformers in vision can be explained by the use of supervision in their pretraining.
DINO proposed a simple training algorithm in which using a Knowledge Distillation setup one can predict the outputs of a teacher network (with a momentum encoder as per MoCo) using a standard cross-entropy loss.

Let us briefly review the technical contributions of the DINO framework:
- DINO employs Knowledge Distillation as the base training paradigm. In Knowledge Distillation one trains a student network to match the output of a given teacher network, with the idea that training a smaller model to mimic the output of a larger network compresses the model by using the soft probabilities from the teacher network that contain a lot more information compared to simple class labels.
- These soft probabilities from the teacher network care then centered over the mean and softened using temperature scaling.
- Moreover DINO employed a form of dynamic distillation wherein the teacher is dynamically built during training. Only the parameters of the student model are learnt during training while those of the teacher are constructed using an exponential moving average over past iterates.
- DINO was trained using a composite loss wherein apart from the standard image classification loss, a cross entropy loss was also employed to align the teacher and student output probabilities.
What Is DINOv2 and how it works
DINOv2 (Learning Robust Visual Features without Supervision) extends this study by asking if Self-Supervised Learning has the potential to learn general purpose visual features if pretrained on a large quantity of curated data.
They combine the DINO framework with existing approaches like iBOT and SwAV to design a more stable and accelerating discriminative training algorithm that excels when scaled in data and model sizes. They report 2x faster training times and 3x fewer memory consumption allowing them to leverage longer training with larger batch sizes to learn robust visual features.
Architecture
DINOv2 differs from the original in the following ways:
- Patch Level Objective: A direct translation of Masked Language Modelling is Masked Image Modelling (MIM) wherein random patches from an image are masked and the model is trained to fill in the patches. DINOv2 employs this technique in a Knowledge Distillation setting as inspired from iBOT. Some input patches to the student network are randomly masked while the teacher is fed the visible tokens. While some previous works have mentioned that sharing parameters between the DINO and iBOT heads leads to better performance, the authors note that at scale the opposite is true. Therefore they employ separate DINO and iBOT heads.
- Sinkhorn-Knopp centering from SwAV: In a recent ICLR paper by Ruan et al. it was empirically recommended to replace the teacher softmax + centering step of DINO and iBOT by the Sinkhorn-Knopp batch normalization technique from SwAV. Since DINO employs a form of Dynamic/Online Knowledge Distillation the standard Sinkhorn-Knopp algorithm from optimal transport literature doesn’t work in this case. Therefore, the online variant as proposed in SwAV is used. For a full overview of the algorithm please refer to the original paper.
- KoLeo regularizer: The authors draw inspiration from the differential entropy literature and use the KoLo regularizer. This regularizer has also been used in the similarity search literature to generate a uniformly distributed embedding space (For more details refer to Spreading vectors for similarity search).
The authors also note that the key to increasing performance on pixel level downstream tasks such as segmentation or detection where small objects tend to disappear at lower resolutions is to use larger image resolutions.
However instead of using larger image sizes all throughout training thereby resulting in longer runs and memory intensive training, they increase the resolution during a short period at the end of training.
Training Process and Data Pipeline
Let’s take a closer look at how DINOv2 was trained—from assembling a massive, diverse dataset to leveraging a fully self-supervised training pipeline without relying on text labels.
- Training Data Scale and Sources: The authors assembled their own diverse dataset LVD-142M by retrieving images from a “large pool of uncurated data” but close to those in curated datasets such as ImageNet-22k and Google landmarks. These uncurated data sources are publicly available repositories of crawled web data. In particular they extract URL links of images from the HTML <img> tag.
- Automatic Data Curation Pipeline: During curating the image dataset the authors discard image URLs that are unsafe or restricted by domains. Moreover they pre-process using standard tricks such as PCA hash deduplication, NSFW filtering and blurring identifiable faces. They also employ the copy detection pipeline of SSCD (from Self-Supervised Descriptor for Image Copy Detection) to remove near duplicate images. It has been noted that a major problem in dealing with images from the wild is to rebalance concepts and avoid overfitting on a few dominant modes.
- Self-Supervised Training (No Labels): Most traditional methods for pretraining foundational models for vision use a form of text-guided pretraining i.e. a form of textual supervision using the captions to guide the training of image features. This form of pretraining limits the information that can be retained during training since image captions only weakly approximate the information present in an image. Moreover pixel level information is not used during text-guided training. These methods also require image encoders that align the texts with images. Self-Supervised training on the other hand uses data similarities between images instead of external metadata in the form of captions. Thus one can avoid manual image annotation all together and use raw data alone allowing flexibility over text counterparts.
- Comparison: Self-Supervised vs. Weakly-Supervised (CLIP) Features: The authors also validated the performance of DINOv2 with SOTA weakly supervised models. They report that DINOv2 surpasses OpenCLIP and EVA-CLIP on linear evaluation performance using large ViT models. Moreover they report a higher score on test sets suggesting better generalization through robust visual features.

Frozen Features and Versatile Outputs: pretraining of such form is great at learning transferable “frozen features” i.e. once pretrained we can use such features on different tasks since they act as a reliable backbone. Thus one can simply put a linear probe on top of these frozen features and use them for various tasks.