3D Active Learning using Self-Supervised Representations


In the ever-evolving field of computer vision, the quest for accurate and efficient 3D object detection on point clouds remains a prominent challenge. Traditional approaches heavily rely on manually labeled datasets, which are time-consuming and resource-intensive to obtain. However, a solution has emerged — active learning combined with representations learned through self-supervised learning.

Active learning empowers models to actively select the most informative data samples for annotation, significantly reducing the annotation effort while maintaining or even improving detection performance. Meanwhile, self-supervised learning techniques allow models to learn powerful representations from unlabeled data, enabling them to capture diverse and meaningful features from raw data. In the past years and with the rise of transformer architectures in computer vision, self-supervised learning on point clouds has gained in popularity.

In this Medium blog post, we explore active learning for 3D bounding box detection on point clouds. We explore how it can leverage self-supervised representations to construct a diverse and informative dataset.


Example image from the Kitti cars dataset with 3D bounding boxes. Source: Official website.

For the purpose of this blog post we’ll use the Kitti 3D Object Detection Evaluation dataset. The dataset consists of 7481 training images and 7518 test images, along with their corresponding point clouds, containing a total of 80,256 labeled objects. Precision-recall curves are computed for evaluation, and the methods are ranked based on average precision. The evaluation follows the PASCAL criteria used in 2D object detection. In this blog post we focus on 3D car detection.

Example of a 3D point cloud in the Kitti dataset. Source: 3D-Detection-Tracking-Viewer.


3DSSD (3D Single Shot MultiBox Detector) is a deep learning model designed for 3D object detection from point clouds. It extends the popular Single Shot MultiBox Detector (SSD) framework, originally developed for 2D object detection, to the 3D domain. The figure below shows an overview of the model.

3DSSD: Point-based 3D Single Stage Object Detector. The backbone network generates global features from raw point cloud data (x, y, z, r). Then, the next layer generates candidates for the prediction head. Source: Arxiv.

Although 3DSSD is a bit of an older model, we decided to use it because of its memory efficiency, ease of use, and its ability to work with raw point clouds. For all experiments we used mmdetection3d with default settings for Kitti.

3D Self-Supervised Learning (ReCon)

The ReCon paper addresses the characteristics and limitations of contrastive and generative modeling approaches in 3D representation learning. It aims to combine the strengths of both paradigms to improve the performance and scalability of 3D representations. The proposed method, called Contrast with Reconstruct (ReCon), utilizes ensemble distillation to learn from generative modeling teachers and single/cross-modal contrastive teachers. An encoder-decoder style ReCon-block is introduced to transfer knowledge through cross attention with stop-gradient, mitigating issues related to overfitting and pattern differences. RECON achieves state-of-the-art results in 3D representation learning, demonstrating high-capacity data efficiency and generalization in pretraining and downstream representation transferring tasks. An overview of the method and how it compares to traditional contrastive or generative masked modeling is shown in the figure below.

ReCon: Contrast with Reconstruct and how it compares to traditional contrastive or generative masked modeling. Source: Arxiv.

We leverage the strong generalization capabilities of ReCon and use the pre-trained backbone from the GitHub repository to generate embeddings of the whole Kitti dataset. Note that we could probably get even better results by finetuning the backbone on our dataset but this would go beyond the scope of this post.


To establish our baseline model, we randomly select 5% (185) of the point clouds from the training set. Alternatively, we could have used Lightly to preselect the initial 185 images. However, our primary objective is to demonstrate how different selection strategies can enhance an existing dataset. As a result, we maintain the initial dataset as a random subset of 185 point clouds. We then add another 5% of the original training set and compare how different selection strategies compare. In particular, we want to compare:

  • Random selection: Select samples uniformly at random.
  • Active learning: Use the objectness score of the prediction model to determine how certain or uncertain the neural network is about a given prediction. Add more samples where the model is uncertain.
  • Diversity based selection: Here, we use Lightly’s diversity selection to get a diverse yet representative set of point clouds.
  • ProbCover: For completeness, we also compare against a strong method from a paper called Active Learning Trough a Covering Lense. It tries to achieve high coverage in the embedding space while ignoring outliers. We do a grid search over the parameter delta and only show the best result.

For all subsequent experiments, we employ two distinct seeds of the 3DSSD model. Consequently, the presented plots also display the standard deviation for additional insights. For Kitti 3D object detection evaluation there exist three difficulty levels metrics:

  • Easy: Only large bounding boxes and no occlusion.
  • Moderate: Only large, and medium bounding boxes, some occlusion.
  • Hard: Includes small bounding boxes with a lot of occlusion.

As is custom in academia, we show mAP@70 at moderate difficulty.

Performance of the 3DSSD model after finetuning on different subsets of the Kitti dataset. Lightly’s diversity algorithm and probcover outperform random selection. Active learning does not  work on this data.

The presented plot depicts the outcomes of our experiments. As anticipated, augmenting the baseline with additional samples improves the performance of the model across all strategies. Nevertheless, notable differences arise when considering the quality of the added training samples:

  • Embedding-based methods surpass both random selection and active learning.
  • Lightly’s diversity approach yields the best results, closely followed by ProbCover.
  • Active learning underperforms when compared to random selection.

We hypothesize that the poor performance of active learning can be attributed to the high redundancy present in Kitti’s frames, causing active learning to select clusters of point clouds where the model exhibits uncertainty. To investigate this further, we generate a scatter plot contrasting the point cloud embeddings chosen by active learning and diversity selection. It is evident that in the case of active learning, the selected point clouds tend to be more clustered together.

Top: UMAP scatter plot of samples selected with active learning. Bottom: Scatter plot of samples selected by Lightly’s diversity selection. The plot in the top shows how active learning tends to select redundant samples.

Furthermore, we conduct an analysis of the labeled objects resulting from the different selection methods. This analysis provides valuable insights into the associated costs incurred alongside the improvement in accuracy.

The table below presents the counts of objects for the different selection strategies:

Left: Relative improvement of the different strategies over random selection. Diversity based selection improves over random by almost 20%. Right: Number of objects that require annotation to improve the accuracy by 1% mAP. Lightly’s diversity selection requires the least amount of annotated objects.

Notably, both random selection and diversity-based selection contribute a comparable number of objects to the dataset. However, due to the enhanced performance achieved by training on a diverse dataset, the number of objects needed to improve the model by 1% mAP is significantly lower. On the other hand, ProbCover tends to prioritize frames with a higher object density, providing a robust training signal but at the expense of increased annotation costs. The improved performance comes with a price! Lastly, we discover an alternative explanation for the subpar performance of active learning: it predominantly selects point clouds containing a limited number of objects.


In conclusion, our exploration of active learning combined with self-supervised learning representations for 3D object detection on point clouds has provided valuable insights. By leveraging self-supervised learning’s rich representations and intelligent data selection through active learning, we improved detection performance. Embedding-based methods, particularly Lightly’s diversity approach and ProbCover, outperformed random selection and active learning. Active learning’s lower performance may be attributed to selecting clusters of point clouds with uncertainty. The analysis of labeled objects revealed that a diverse dataset required fewer objects for a 1% improvement in mAP, while ProbCover prioritized high-density frames at increased annotation costs.

As next steps we want to investigate the effect of finetuning the self-supervised learning model on the target dataset, explore other measures for model uncertainty, and look into using more recent model architectures for 3D object detection. Feel free to reach out if you’re interested in the results, have feedback or ideas of your own!

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 

Philipp Wirth
Machine Learning Engineer

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us