Knowledge distillation compresses large models into smaller ones by training a student to match a teacher’s outputs. It enables fast, lightweight deployment on limited hardware with minimal accuracy loss and supports vision, language, and multi-teacher use cases.
Here’s a quick overview of key information about Knowledge Distillation and its importance.
It’s a model compression technique where a large teacher model transfers its knowledge to a smaller student model. The student is trained to mimic the teacher’s outputs (often the teacher’s probability distribution across all classes, known as soft targets) so that the smaller model achieves similar performance to the large model.
The teacher-student framework involves first training a complex teacher network (e.g. a deep neural network with many parameters). Then the student network is trained on the same task, but instead of learning only from ground-truth labels, it learns to match the teacher model’s predictions (soft probabilities over classes). A special distillation loss (e.g. Kullback–Leibler divergence between the teacher and student outputs) is used alongside the regular loss to align the student’s outputs with the teacher’s.
Knowledge distillation enables taking a large, accurate model and producing a smaller model that is faster, lighter, and suitable for deployment on resource-constrained devices (like mobile phones or embedded systems) – all while retaining close to the original accuracy. It is a form of knowledge transfer or knowledge compression that allows deep learning models to be used in real-time applications and embedded AI without heavy compute costs.
Originally applied in computer vision (e.g. image classification, recognition) and later in natural language processing, knowledge distillation is now common whenever we need to compress large and complex models. For example, big vision models or large language models can be distilled into smaller ones for faster inference. It’s useful in scenarios like deploying AI on smartphones, IoT devices, or leveraging an ensemble of models as a single compact model.
Yes. Knowledge distillation can be done in various ways: offline vs. online distillation (whether the teacher is fixed or co-trained), self-distillation (the model teaches itself via its own layers), and using multiple teacher models (an ensemble of teachers) or even cross-modal distillation (teacher and student on different data modalities). There are also different forms of “knowledge” that can be distilled – e.g. distilling final outputs (response-based), intermediate feature maps (feature-based), or relationships between examples (relation-based). We’ll explore all these in detail below.
Modern deep learning models often achieve state-of-the-art performance at the cost of massive computational resources, making them impractical for real-time or edge deployment.
Knowledge distillation addresses this by transferring the capabilities of large models into smaller, efficient ones without significant loss in accuracy.
In this article we’ll explore:
And if you're curious about hands-on experimentation with knowledge distillation in computer vision, consider exploring LightlyTrain.
It enables you to pretrain vision models like DINOv2 on your own unlabeled data using self-supervised learning.
Knowledge distillation is a model compression and knowledge transfer mechanism in which a large model (teacher) transfers the “knowledge” it has learned to a smaller model (student). The core idea popularized by Hinton et al. (2015) is that a compact (or rather smaller) model can be trained to reproduce the behavior and performance of a more complex model.
In essence:
Key characteristics and goals:
Preserving Performance: The goal is usually to have the student model’s accuracy approach (or sometimes even exceed) the teacher’s accuracy. A well-executed distillation can yield a model that’s far more efficient in terms of speed and memory, with only a minor drop in accuracy from the teacher model – if any.
At a high level, the knowledge distillation training process involves two stages: (1) Training the teacher (if not already trained), and (2) Training the student with the teacher’s guidance. Let’s break down the typical workflow and core components:
1. Train the Teacher Model: First, you need a trained teacher model. This is often a large deep neural network (or an ensemble of models) that has high accuracy on the task. In many cases, the teacher is a pretrained model (possibly even a proprietary model or an ensemble used as a reference).
2. Obtain the Teacher’s Predictions (Soft Targets): For each example in the training set you run the input through the teacher model to get its output probabilities for each sample. These then become the soft targets or teacher’s probability distribution over classes. For example, if the task is image recognition with classes {cat, dog, rabbit}, the teacher might output [0.85, 0.10, 0.05] for a cat image.
3. Train the Student Model: The student model is then trained using a combination of:
3.1 Original Training Data & Labels: The student still learns from the true labels (hard targets) via a normal supervised loss (e.g. cross-entropy)
3.2 Teacher’s Soft Targets: In addition, the student learns from the teacher’s output distribution. We add a distillation loss term that measures how well the student’s output matches the teacher’s output for each training example. This loss encourages the student to reproduce the teacher’s probability distribution, not just get the correct output.
3.3 Temperature Scaling: A detail introduced by Hinton et al.(2015) is the use of a temperature (T) term to soften the probability distribution. By using a higher temperature in the softmax of the teacher and student, the probability distribution is smoothened (the highest probabilities are lowered and the smaller probabilities are raised). This emphasizes the relative probabilities of all classes. The student is trained to match these softened distributions. A higher temperature value puts more weight on matching the teacher’s full distribution (when T=1, it’s the normal softmax probabilities).
3.4 Loss Function: Typically, the total loss for training the student is a weighted sum of the distillation loss and the standard task loss (with hard labels). For example:
Loss = α (Distillation_Loss_KL) + (1−α) (Hard_Label_Loss)
where α is a weight (and sometimes the distillation loss itself includes 1/T² factor depending on formulation). This way, the student learns to balance mimicking the teacher by actually fitting to the true labels.
4. Result – The Distilled Student: After training, we get the distilled student model. It should ideally perform nearly as well as the teacher on the validation/test set. The student model is much smaller in size (fewer parameters, lighter architecture) and yields faster inference.
Intuitively, the teacher’s soft targets provide richer information compared to just labels. The student doesn’t just learn what the correct answer is, but also how the teacher would "grade" each possible answer.
This helps the student model in two ways.
It provides additional training signals for the student model, especially for those samples where the teacher is very confident or where classes are confusing. It acts as a form of regularization. By mimicking a stronger model, the student’s learning is guided towards a solution that generalizes similarly to the teacher.
In practice, knowledge distillation has been shown to improve student model generalization and even achieve higher accuracy than training the small model on data alone. This distillation process is especially powerful when you have a very large teacher model and a significantly smaller student model.
Not all knowledge from the teacher has to come from the final output probabilities. The three primary ways of transferring knowledge.
This technique focuses on the final output layer of the teacher model – essentially the teacher’s predicted probabilities. The student is trained to match these output responses of the teacher. This approach treats the teacher as a “black box” that provides answers, and the student learns to imitate those answers.
Instead of only looking at outputs, feature-based methods transfer knowledge from the intermediate layers (hidden layers) of the teacher to the student. The idea is that the teacher’s layers learn rich feature representations of the data. We can guide the student to learn similar feature maps in its own layers.
Typically, you choose certain layers of the teacher and student (e.g., each block or at corresponding depths) and add a loss term that penalizes the difference between the student’s feature activations and the teacher’s for the same input. This way the student not only matches outputs but also internal representations, which can lead to better learning of the task-specific features.
Relation-based methods go one step further by distilling the relationships between multiple activations or data samples from teacher to student. Instead of individual outputs or features, the knowledge here could be, for instance, the pairwise distances between examples in the teacher’s feature space, or the teacher’s attention maps capturing interactions between parts of an input.
Other forms include capturing structural knowledge: e.g., in a teacher Transformer model, which tokens attend to which (the attention matrix) can be seen as relational knowledge; a student Transformer can be trained to have similar attention patterns (this has been explored in NLP distillation).
There are several ways to orchestrate the training of teacher and student models. The original approach by Hinton is often termed offline distillation, but other schemes like online and self-distillation have emerged.
Offline distillation means the teacher model is already trained on a task and then kept fixed (its model weights are frozen during student training). The student is trained using the static teacher’s knowledge. This requires having access to a good teacher model beforehand.
In some cases, you might not have a pretrained teacher, or you want the teacher to adapt during training. Online distillation involves training the teacher and student simultaneously in a coupled manner. Sometimes two models (one large, one small) are trained together on the same task. The large one acts as a teacher for the small one on the fly — at each iteration or epoch, the big model’s current predictions teach the small student model.
Online distillation is useful if a strong teacher is not readily available or to avoid having to train a giant model first. For example, in scenarios of continual learning or streaming data, an online approach might continuously transfer knowledge from a slowly improving teacher into a small student for real-time use.
In computer vision, state-of-the-art machine learning models for tasks like image classification, object detection, and semantic segmentation are often extremely large.
Distillation allows us to compress these large vision models while keeping high performance. Some common scenarios:
Cross-Modal and Multi-Task Learning: As mentioned, KD is used in multi-modal systems (vision→ text, audio→ video, etc.) and also for multi-task scenarios (a teacher performing multiple tasks can teach a student to do the same, transferring holistic knowledge).
Like any technique, knowledge distillation comes with its set of advantages, challenges, and tricks to get the best results. Here we outline some key points:
Here are some key benefits:
Finally, here are some limitations of knowledge distillation.
Below are some guidelines that will help you run distillation in the most effective way possible:
Knowledge distillation has become a cornerstone technique in modern deep learning engineering, enabling the deployment of large and complex models in scenarios with limited resources by transferring knowledge to compact models. We’ve seen how the teacher-student framework compresses models, the various types of knowledge that can be distilled (outputs, features, relationships), and the many schemes and extensions that make KD a versatile tool.
Looking forward, research in knowledge distillation continues to expand:
In summary, knowledge distillation stands as a powerful knowledge transfer mechanism in machine learning – one that continues to evolve. By applying the techniques and best practices outlined in this article, you can leverage KD to build smaller, efficient deep learning models that maintain the prowess of their larger counterparts.
Get exclusive insights, tips, and updates from the Lightly.ai team.