Active Learning for Object Detection on Videos

Object detection in videos has become increasingly critical in diverse applications, ranging from autonomous vehicles to surveillance systems. However, annotating extensive video datasets manually presents a major challenge. To tackle this issue, active learning offers a solution to reduce labeling costs and enhance object detection model performance.

In this blog post, we explore active learning in the context of video object detection, comparing two approaches: metadata-based selection and object-diversity active learning. The former balances across videos and thus also across environment, lightning, and weather conditions, while the latter focuses on selecting frames with informative objects. We show that object diversity selection leads to higher object detection performance, both per labeled frame and per labeled object.

BDD100k dataset

example image from the BDD100k dataset

We use the BDD100K dataset as a benchmark dataset. It is a large-scale and diverse driving video dataset for many machine-learning tasks. The dataset contains 100,000 videos with 40 seconds, covering more than 1,000 hours of driving experience and more than 100 million frames. The dataset covers various geographic, environmental, and weather conditions. This helps models trained on the dataset to adapt to different scenarios. The dataset also provides annotations for 10 tasks, including object detection.

Of course, not all of the millions of frames are fully labeled, that would be extremely expensive. Instead, we use a subset of the MOT20 subset in this benchmark, consisting of 200 training and test videos each. The videos have 5 fps and thus on average 200 frames for the 40s duration. These 40000 training and test frames each have full object detection labels.

Active Learning

Active learning is the choice of a training subset to be labeled, such that a model trained on this subset performs as well as possible. There are many different approaches to select this subset, e.g. selecting it such that its samples are information-rich, diverse, and representative. Especially for video datasets, selecting frames with little visual redundancies between them helps a lot to increase performance, as already shown in this paper. For more information, see our overview of active learning methods.

Apart from increasing model performance given a fixed dataset size or labeling budget, active learning can also reduce the effort needed to get a target model performance. Here two kinds of effort have to be distinguished:

  • The labeling cost for object detection is usually proportional to the number of bounding boxes.
  • The costs of data handling are proportional to the number of frames. As model training duration is also proportional to the number of frames, its costs are also proportional to it.

Thus total effort for developing object detection models scales with both the number of images and bounding boxes in the datasets.

Benchmark setup

Benchmarking active learning methods requires 3 components:

  • The selection parameters: The input dataset and the number of samples that should be selected.
  • The active learning methods themselves, choosing a subset out of a larger dataset.
  • A machine learning model with its hyperparameters, which is trained on the subset of the training dataset selected by the active learning method. Then this model is evaluated on a test dataset.

Selection setup

The input dataset to select the training dataset from has 200 rather short videos. As frames from the same video are quite similar, it does not make sense to select many frames per video. Thus we do the benchmarks with 2 and 4 frames per video, which are 400 and 800 frames in total.

Active Learning methods

One very simple active learning method for videos is to select balanced across different videos. The videos from the BDD100k dataset show many different scenes (urban, country road, …) and they were taken during different times of the day and in different weather conditions. Thus balanced selection allows getting a diverse and representative selection with respect to this video-level metadata. Within each video, the frames are selected randomly.

We compare this simple baseline approach with an object diversity selection approach. The idea is to choose video frames such that the objects in them differ as much as possible. The diversity between objects is measured by first using a self-supervised embedding model to generate embeddings for each object and then measuring the distance in the embedding space. For details, see our docs on object diversity selection.
To find out which objects exist in a video, predictions for it are needed. To create these predictions, we first select 400 frames randomly, label them and then train the object detection model on them. It is then used to create predictions. Last, the object diversity selection selects 400 more samples on top of the 400 already randomly selected samples, for a total of 800 samples.

object diversity selection setup

Object detection model and test dataset

We use the YOLOv7 large object detection model with the implementation by the mmyolo project and train it on the subsets selected by the active learning methods. The standard hyperparameters are used and the model is trained for 100 epochs. The evaluation is done on the validation set of the MOT20 subset of the BDD100k dataset, which has 200 videos with about 200 fully labeled frames each.

Qualitative Selection Analysis

To analyze the active learning methods qualitatively, we manually have a look at their selected images and at the distribution of the different classes.

Example images

first 24 images selected by the balanced per video selection. Images in the same box are from the same video.

The example images of the balanced per video selection show that exactly 4 frames per video have been selected. As some of the videos have a lower variance in them, some frames are quite similar.

first 24 images selected by the object-diversity selection. Images in the same box are from the same video.

The example images selected by object-diversity selection show that the number of frames selected per video varies a lot. The videos from which many frames are selected are usually urban scenes with pedestrians, e.g. at crosswalks.

Class distribution

Number of objects per class in input, balanced per video selection and object-diversity selection. Mean over 2 repetitions per selection.

The input / 50 is the expected number of objects when selecting 800 samples randomly out of 40,000 samples. The input is highly imbalanced, with most objects being cars. The balanced selection across videos keeps this imbalance and selects almost the same number of objects as expected by a random selection.

Object-diversity selection, however, selects much more samples from all classes except the cars. It does so even though the (predicted) class label is never used in selection. The diversity across classes comes alone from self-supervised learning allowing us to measure the diversity of objects in the embedding space.

Benchmark Results

In the results, we compare 4 selections:

  • balanced selection across videos, or 4 frames per video
  • random selection of 400 and 800 frames respectively
  • object diversity selection of 400 additional frames on top of 400 random frames

All selections and model training were repeated with two random seeds each.

In all cases, selecting 800 instead of 400 frames leads to a higher mAP. Random selection and balanced selection across videos both lead to a mAP increase of about 1–1.5%. The variance due to randomness in both selection strategies is too high to find a significant difference.

Object diversity selection, however, leads to a much higher mAP increase. As it is a deterministic selection strategy, the variance only comes from the model training. Thus the total variance is much lower.

mAP comparison of random, balanced across videos and object diversity selection. Error bars from repetitions with 2 random seeds each.

To compare the selections fairly, we not only look at mAP for a fixed number of frames but also at the number of objects. Both random and balanced across videos selection select samples with about 9.6 objects per frame on average. Object diversity selection, selects much more objects, namely 13.1 objects per frame on average. Even when taking this into account, object diversity selection still leads to a higher mAP gain per object.

performance comparison of all strategies. All numbers are the mean of repetitions with 2 random seeds each.


In conclusion, this benchmark has shown that object-diversity selection leads to much better object detection performance on the MOT20 dataset using a YOLOv7 model. The higher performance can be attributed to selecting frames with much more objects of underrepresented classes.

Overall, the performance boost by using object-diversity selection can be used to make machine learning models quicker and cheaper to develop and/or allow to increase their performance without needing large amounts of labeled data and compute.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Malte Ebner
Machine Learning Engineer

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us