Learn about the advantages and disadvantages of Active Learning methods and which of them solves your problem.
Why is Active Learning relevant?
With the advent of deep learning, machine learning models are now trained on huge amounts of data allowing us to solve completely new use cases. However, gaining large labeled datasets comes at a cost. Take the famous image classification dataset ImageNet with 14 million images as an example: Given that classification annotations cost around $25 per 1000 images, the cost for labeling such a dataset are already $350,000. Additionally, the expenses for storing and handling large amounts of images and training on so much data have to be added. Furthermore labeling large amounts of data can take months, slowing you down.
This problem has led to the research field of Active Learning. It chooses a subset of your unlabeled data. Only that data will later be labeled and used for training your model. By choosing this subset well, your model can have similar or even better performance than if trained on the whole subset. This leads us to a very broad definition of Active Learning:
Active Learning is the choice of a subset of unlabeled data to be labeled such that a machine learning model trained on this data performs best.
This broad idea has led to many different ideas how this subset should be chosen best. Researchers have then built upon ideas of other researchers by refining their methods and combining them with other methods to do active learning even better. To understand papers in this area and how they can help you with your use case, it helps to group them so that you can spot related ideas quickly.
The graph makes three kinds of distinguishments:
- Do you use data about the samples specific for your task and classes? Or do you use general data about the images? While doing task-dependent active learning allows to get the last bits of performance, it is harder to implement and can struggle e.g. in low data settings.
- Which kind of data do you use as criteria for your selection?
- What do you do with the data? Some ideas here are based on intuition and heuristics, some are even mathematically proven to increase your performance.
Next, well go through all these different methods to understand why they developed and which advantages and drawbacks come with them.
Task-dependent active learning
Task-dependent active learning optimizes for your specific computer vision task, e.g. classification, object detection or segmentation. It starts with a machine learning model that is at least similar to the model you finally have. Most often, this is your machine learning model trained on a small starting set of your data. E.g. you can first choose 1% of your data randomly, label it and then train your machine learning model on this starting set. Alternatively you can also use a pretrained model, e.g. a general purpose object detector if your task is about object detection for specific cases.
Using prediction probabilities
Given a model for your task, you let it run inference on all your unlabeled data to get predictions for all of them. E.g. for classification, object detection and both instance and semantic segmentation, you probably get a prediction probability vector over your classes that tells something like “The model is 80% sure this is class A, 15 % sure it is class B and 5% it is class C”. This probability vector is the basis for the most popular active learning methods: Uncertainty based active learning. You compute out of the probability vector how unsure you model is about the prediction.
Classic Uncertainty selection
Then, in the selection step of active learning, you just choose those samples whose uncertainty is highest. The idea is intuitive: Those are the hard samples for your model, thus adding them to your training set helps your model most. As they are close to the decision border of your model, they are well-suited to refine it.
The classical uncertainty selection methods are Entropy Sampling, Margin Sampling and Least Confidence Sampling. You have probably already read about them and they are covered extensively in existing literature, e.g. in the Active Learning Literature Survey (2010).
Classic uncertainty selection has many drawbacks:
First, selecting the “hard” samples close to the decision border can even hurt performance, as it causes the model to overfit on fine nuances or even label error, but fails to get the bigger picture and the high-signal features. This problem was shown in the recent paper Uniform versus uncertainty sampling: When being active is less efficient than staying passive (2022).
As an example, take the binary classification problem in 2d-space solved by a 10-nearest-neighbour classifier shown on the next two illustrations. The optimal decision border would be a vertical line through the middle. Uncertainty sampling will thus choose samples close to it.
A second problem of uncertainty sampling is that deep learning models can be very bad at estimating their own uncertainty.
Both of these problems were addressed by Information Estimation Methods which compute the uncertainty of predictions in a better way.
Better Informativeness Estimation
There are several methods that try to estimate the uncertainty of a model on a sample better than just using the plain prediction vector:
Methods like in the Learning Loss for Active Learning (2019) paper try to predict the loss of a prediction, forcing the model to accurately estimate its own uncertainty:
Bayesian methods see the prediction of a machine learning model as the realisation of a stochastic process. Then they estimate the variance of this stochastic process and use it as a proxy for the uncertainty of a sample.
This stochastic process can come in many different forms: E.g. it can be the discrete choice of different machine learning models or ensembles trained on the same set. Then these different models all predict on each sample and the disagreement between the models is the uncertainty of this sample. This approach is also called Query-By-Commitee (1992) and a very classic approach.
For deep learning models, using ensembles of different models can be quite costly. One of the most popular recent active leaning papers, Deep Bayesian Active Learning with Image Data (2017) solved this by approximating ensembles by using dropout also during prediction. The randomness of dropout leads to different predictions and the variance of predictions can then be used to choose the best samples to add. Their method is called Bayesian Active Learning by Disagreement (BALD).
However, even a method predicting its own uncertainty perfectly cannot overcome two severe drawbacks of uncertainty selection:
Choosing too similar samples
When choosing a batch of many samples at once, two samples that are very similar and both uncertain can be chosen. However, adding one of them would be enough.
Let’s assume that all of the “8” digits have the highest uncertainty score according to BALD and are chosen. However, this ignores the fact that all the digits look similar. Instead, a batch-aware variant could would choose different uncertain images. Thus, the paper BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning (2019) extended BALD by additional choosing diverse samples. This is the key idea followed by Diversity Selection, which we will cover in a later section.
Uncertainty sampling might choose outliers that are not representative for the overall distribution. Representative Selection overcomes this problem and is also covered in a later section.
While most academic datasets are balanced w.r.t. the their classes, real-world datasets can be strongly imbalanced. This can pose problems for many active learning algorithms that optimize for the common cases and classes, but do not take per-class performance into account. Thus balancing active learning methods have emerged that try to keep a uniform distribution over the different classes by preferably choosing samples from underrepresented classes and/or not choosing samples from overrepresented classes. Very different to uncertainty sampling, it thus sees the value of a sample not only depending on itself, but also depending on the other samples selected so far.
While balancing has no serious drawbacks and can improve performance significantly for underrepresented classes, it does not provide much value if your dataset is already balanced or you have many samples per class anyway. Thus it is usually combined with uncertainty sampling, which can yield big performance gains, especially on imbalanced datasets. See the Class-Balanced Active Learning for Image Classification (2021) and Minority Class Oriented Active Learning for Imbalanced Datasets (2022).
Task-indepedent active learning
Task-independent active learning chooses samples that are good in general, no matter whether you want to train a classification, object detection or segmentation model on it or which classes you have. They also work if you don’t know exactly what to do with your dataset and which kind of samples it contains. This makes them much more general and easy to use.
Metada-based active learning
Metadata-based active learning is probably the most used active learning method of all, but it is so simple that is usually not called active learning. It is based on additional data about your images. This data can be external, e.g. the time your image was taken or the location. It can also be computed out of the image, e.g. the sharpness or luminance of it.
E.g. let’s assume that your model struggles with images from location A. Then you select and label more images from location A. And if your model struggles with images taken during rush-hour, you select more images taken between 7 and 9 in the morning. If your camera is sometimes covered with water drops making those images blurry, you can filter them away. The same goes for images from smartphones where the camera was accidentally covered by a finger.
Instead of making hard in-or-out decisions, you can also use metadata in a softer way: E.g., prefer samples with a high value of some metadata.
There are little research papers on this subset, because academic datasets usually have all this preprocessing already done. However, if you know that your dataset has problems with special kinds of images, it can help a lot to use a simple metadata filtering or weighting.
As metadata selection uses metadata as a proxy for the informativeness of a sample, it also inherits its drawbacks: It might choose too similar samples when choosing a batch of many samples at once and it might choose outliers.
Embedding-based active learning
An embedding is a lower-dimensional representation of an image in an euclidean space. It has the property that similar images have embeddings with a low distance to each other and very different images have very different embeddings.
These embeddings can be gained in different ways. E.g. you can use the activations of the last layer of a deep learning model trained on your data as embeddings, like in the Active Learning for Convolutional Neural Networks: A Core-Set Approach (2017) paper.
If the deep learning model trained on your data did not learn many features yet, these embedding might not be good. Instead, using the gradients instead of the activation can help, as done in Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds (2020).
However, extracting these embeddings requires changes to your code of training your model, making it harder to use. Using embeddings that are independent of any supervised model can also be valuable and cause much less implementation effort. Self-supervised learning is useful to generate these embeddings without needing any labeled images. See Extending Contrastive Learning to Unsupervised Coreset Selection (2021) as an example.
As already discussed before, similar samples contain very similar information and adding one of those similar samples is usually sufficient. This approach was made popular by the Active Learning for Convolutional Neural Networks: A Core-Set Approach (2017) paper. It chooses samples that are different from each other. Furthermore, it mathematically proved that choosing the samples to label such that they cover the space of unlabeled samples best minimizes the performance difference between training on the whole set and the subset. When you think more about it, this is very intuitive: An unlabeled sample that is already covered by a labeled sample does not contain new information, thus you don’t need it.
In many use cases, optimizing for the common case is enough. You don’t need your model to work well for edge cases and outliers, but it should work really well for the large majority of samples it sees. This can also be achieved by embedding-based active learning: Samples that are within a group of many similar unlabeled samples (i.e. sample with similar embeddings) can be preferred and samples whose embeddings are outliers can be removed. This idea was used in the Exploring Representativeness and Informativeness for Active Learning (2019) paper.
Representative performs very well when you want your model to perform really good for the common cases. An example business case is object detection for recycling: You want your model to perform very well for the majority of waste, because this is where you make money recycling it. Misclassifying an object now and then does not matter too much.
However, sometimes you want your model to have acceptable performance especially for the outlier cases, e.g. if your model is used in autonomous driving or for safety and security applications. In that case, you should not use representative selection.
Combining Active Learning methods
You probably already got it: As each of these methods makes sense on its own, but has some drawbacks covered by other methods, why not combine multiple of them? This idea was already picked up by many of the papers discussed so far. The dream team when it comes to combinations are Uncertainty Selection and Diversity Selection. While both can be very strong on their own, they cancel each others weaknesses out, making them even stronger. The use of balanced, metadata and representative selection depends much more on your specific use case. However, knowing about them is very important: It forces you to think about your actual goal and what kind of performance actually matters.
Which active learning methods should I use?
This question is one of the hardest to answer, as it depends a lot on your use case. An answer like “I want to improve the accuracy on my test set” is a poor answer because of two reasons:
First, accuracy is an arbitrary metric and just choosing another metric like f1-score or recall does not make it better. Instead, you should think about it from a business perspective and about the gains and costs of each correct or incorrect prediction. An example for waste sorting: Classifying a sheet of aluminium correctly gives you the value of that sheet as a gain. As plastic is usually less valuable, your metric should reflect this discrepancy and measure your actual business value. Using the active learning method of balancing to choose more samples of aluminium than plastic can increase the business value of your model.
The 2nd drawback of the simple answer is the “test set” part. How did you choose that test set? What would be the perfect test set? Again, this depends a lot on where you want to apply your model. Ideally, you already select your test set using active learning to make sure it has the properties you want it to have. E.g. the subpart of the real world where you want to apply your model on should be represented as good as possible.
How can I try it out?
Setting up a complete active learning frame can be quite time-consuming, especially if you want to use and combine multiple methods. At Lightly we have built a framework offering most of these methods and allow combining them. For the start, I can recommend the tutorial of combining many different Active Learning methods to improve the YOLOv8 object detection model.