How to do active learning on 3D point clouds using self-supervised learning.
In the ever-evolving field of computer vision, the quest for accurate and efficient 3D object detection on point clouds remains a prominent challenge. Traditional approaches heavily rely on manually labeled datasets, which are time-consuming and resource-intensive to obtain. However, a solution has emergedβββactive learning combined with representations learned through self-supervised learning.
Active learning empowers models to actively select the most informative data samples for annotation, significantly reducing the annotation effort while maintaining or even improving detection performance. Meanwhile, self-supervised learning techniques allow models to learn powerful representations from unlabeled data, enabling them to capture diverse and meaningful features from raw data. In the past years and with the rise of transformer architectures in computer vision, self-supervised learning on point clouds has gained in popularity.
In this Medium blog post, we explore active learning for 3D bounding box detection on point clouds. We explore how it can leverage self-supervised representations to construct a diverse and informative dataset.
For the purpose of this blog post weβll use the Kitti 3D Object Detection Evaluation dataset. The dataset consists of 7481 training images and 7518 test images, along with their corresponding point clouds, containing a total of 80,256 labeled objects. Precision-recall curves are computed for evaluation, and the methods are ranked based on average precision. The evaluation follows the PASCAL criteria used in 2D object detection. In this blog post we focus on 3D car detection.
β
3DSSD (3D Single Shot MultiBox Detector) is a deep learning model designed for 3D object detection from point clouds. It extends the popular Single Shot MultiBox Detector (SSD) framework, originally developed for 2D object detection, to the 3D domain. The figure below shows an overview of the model.
Although 3DSSD is a bit of an older model, we decided to use it because of its memory efficiency, ease of use, and its ability to work with raw point clouds. For all experiments we used mmdetection3d with default settings for Kitti.
The ReCon paper addresses the characteristics and limitations of contrastive and generative modeling approaches in 3D representation learning. It aims to combine the strengths of both paradigms to improve the performance and scalability of 3D representations. The proposed method, called Contrast with Reconstruct (ReCon), utilizes ensemble distillation to learn from generative modeling teachers and single/cross-modal contrastive teachers. An encoder-decoder style ReCon-block is introduced to transfer knowledge through cross attention with stop-gradient, mitigating issues related to overfitting and pattern differences. RECON achieves state-of-the-art results in 3D representation learning, demonstrating high-capacity data efficiency and generalization in pretraining and downstream representation transferring tasks. An overview of the method and how it compares to traditional contrastive or generative masked modeling is shown in the figure below.
We leverage the strong generalization capabilities of ReCon and use the pre-trained backbone from the GitHub repository to generate embeddings of the whole Kitti dataset. Note that we could probably get even better results by finetuning the backbone on our dataset but this would go beyond the scope of this post.
To establish our baseline model, we randomly select 5% (185) of the point clouds from the training set. Alternatively, we could have used LightlyOne to preselect the initial 185 images. However, our primary objective is to demonstrate how different selection strategies can enhance an existing dataset. As a result, we maintain the initial dataset as a random subset of 185 point clouds. We then add another 5% of the original training set and compare how different selection strategies compare. In particular, we want to compare:
For all subsequent experiments, we employ two distinct seeds of the 3DSSD model. Consequently, the presented plots also display the standard deviation for additional insights. For Kitti 3D object detection evaluation there exist three difficulty levels metrics:
As is custom in academia, we show mAP@70 at moderate difficulty.
The presented plot depicts the outcomes of our experiments. As anticipated, augmenting the baseline with additional samples improves the performance of the model across all strategies. Nevertheless, notable differences arise when considering the quality of the added training samples:
We hypothesize that the poor performance of active learning can be attributed to the high redundancy present in Kittiβs frames, causing active learning to select clusters of point clouds where the model exhibits uncertainty. To investigate this further, we generate a scatter plot contrasting the point cloud embeddings chosen by active learning and diversity selection. It is evident that in the case of active learning, the selected point clouds tend to be more clustered together.
Furthermore, we conduct an analysis of the labeled objects resulting from the different selection methods. This analysis provides valuable insights into the associated costs incurred alongside the improvement in accuracy.
The table below presents the counts of objects for the different selection strategies:
Notably, both random selection and diversity-based selection contribute a comparable number of objects to the dataset. However, due to the enhanced performance achieved by training on a diverse dataset, the number of objects needed to improve the model by 1% mAP is significantly lower. On the other hand, ProbCover tends to prioritize frames with a higher object density, providing a robust training signal but at the expense of increased annotation costs. The improved performance comes with a price! Lastly, we discover an alternative explanation for the subpar performance of active learning: it predominantly selects point clouds containing a limited number of objects.
In conclusion, our exploration of active learning combined with self-supervised learning representations for 3D object detection on point clouds has provided valuable insights. By leveraging self-supervised learningβs rich representations and intelligent data selection through active learning, we improved detection performance. Embedding-based methods, particularly LightlyOneβs diversity approach and ProbCover, outperformed random selection and active learning. Active learningβs lower performance may be attributed to selecting clusters of point clouds with uncertainty. The analysis of labeled objects revealed that a diverse dataset required fewer objects for a 1% improvement in mAP, while ProbCover prioritized high-density frames at increased annotation costs.
As next steps we want to investigate the effect of finetuning the self-supervised learning model on the target dataset, explore other measures for model uncertainty, and look into using more recent model architectures for 3D object detection. Feel free to reach out if youβre interested in the results, have feedback or ideas of your own!
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Philipp Wirth
Machine Learning Engineer
lightly.ai
β
β
Get exclusive insights, tips, and updates from the Lightly.ai team.