📣 Big news: LightlyStudio is now live! Try it for free.

Customer Success Stories

How Lightly Helped AI Retailer System Achieve 90% mAP While Slashing Labeling Costs

Lightly helped AIRS eliminate redundancy in video frame datasets, enabling them to reach 90% of their top mAP score using just 20% of the training data.

Alejandro Garcia

CEO

Overview

Lightly helped AIRS eliminate redundancy in video frame datasets, enabling them to reach 90% of their top mAP score using just 20% of the training data.

Industry

Retail

Location

Bern, Switzerland

Employee

>100

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.

Book a Demo

Products

LightlyOne

Results

80%

Reduction in Annotation Costs

Use Case

Data curation for object detection in retail

About

Many interesting Deep learning applications rely on the use of complex architectures fueled by large datasets. With growing storage capacities and easier data collection processes[1], it requires little effort to build large datasets. However, when doing so, a new challenge surfaces: data redundancy. Many of these redundancies are systematically introduced through the data collection process. For instance, in the form of consecutive frames extracted from a video or very similar images collected from the web. In this blog post, the results of a benchmark study showing the benefits of filtering redundant data with Lightly are presented. The data was collected by AI Retailer Systems (AIRS), an innovative start-up developing a checkout-free solution for retailers. In this study, we consider an object detection task: an intelligent vision system recognizes products on a shelf or on a customer’s hand.

Redundancies can take multiple forms, the simplest one being exact image duplicates. Another form is near-duplicates, i.e images shifted with few pixels across some direction or images having slight light changes. Redundancies have also been observed in very known academic datasets: CIFAR-10, CIFAR-100, and ImageNet. This does not only lead to biased results of the model’s performance, be it accuracy or mean average precision mAP score, but also lead to high annotation costs.

Image for post — Short video sample extracted from AIRS video

The dataset provided by AIRS consists of images extracted from short videos capturing a customer grabbing different products. Two different cameras recorded videos of the shelf, each from a different angle, and 12 different kinds of products, i.e. 12 classes, were present.

Problem

The dataset was manually annotated using the open-source annotation tool Vatic. Its annotation rate, a rate quantifying how many frames per time unit were labeled, was 2.3 ± 0.8 frames per minute. Given that there are 51 objects on average in each image, this is equivalent to 0.51 seconds for each bounding box.

The annotated dataset has 7909 images. The training dataset has 2899 images, 80% of these images are from camera 2 and 20% from camera 1. For the test dataset, it has 5010 images and all of them from camera 1.

This specific design of the train and test datasets was decided upon according to the following rationale: First, an imbalanced dataset with a high fraction of images coming from one camera is built. Second, the object detection task is made hard for the model. With this train-test setting, we can calculate the fraction of images from Camera 1 in the filtered data, and thereby observe if any re-balancing is introduced by the different filtering methods used. In the following section, the methods used in this case study are presented.

Testimonials

"I was truly amazed once we received the results of Lightly. We knew we had a lot of similar images due to our video feed but the results showed us how we can work more efficiently by selecting the right data"

Alejandro Garcia

CEO

Scalable and Efficient Data Curation using Lightly

Active learning and Sampling methods

To probe the effects of filtering the dataset, we borrowed ideas from the field of active learning.

Active learning aims at finding a subset of the training data that achieves the highest possible performance. In this study, we used the pool-based active learning loop that works as follows: A small fraction of the training dataset, called the labeled pool, is the starting point. The model is then trained on this labeled pool. Thereafter, new data points that should be labeled are selected using the model along with a filtering method. The newly selected samples can then be added to the labeled pool and finally the model can be trained from scratch on the updated labeled pool. After each cycle, the model’s performance on the test dataset for each filtering method used is reported. In our case, 5% of the training data was used as the initial labeled pool, the model was trained for 50 epochs, and 20% of the training data was added in each active learning loop.

The object detection model used in this benchmark study is YOLO V3 (You Only Look Once) [4], along with the implementation provided by the Ultralytics Github repository. The code was slightly modified in order to introduce the active learning loop.

As for the filtering methods, four different filtering methods provided by Lightly were resorted to:

“RSS”: Refers to random sub-sampling, used as a baseline.
“WTL_unc”: This method refers to Lightly's uncertainty based sub-sampling. It selects difficult images that the model is highly uncertain about. The uncertainty is assessed using the model’s predictions.
“WTL_CS”: This Lightly method uses image representations to select images that are both diverse and difficult. It combines uncertainty-based sub-sampling with diversity selection. The image representations are obtained using state-of-the-art self-supervised learning methods using the PIP package Boris-ml. The advantage of self-supervised learning methods is that they don’t require annotations to generate image representations.
“WTL_pt”: Relies on pre-trained models to learn image representations. The filtering is performed by removing the most similar images. Similarity in this case is given by the L2 distance between image representations.

Both Lightly methods “WTL_unc” and “WTL_CS” use active learning, since they use the deep learning model to decide which data points to filter. In contrast, the “WTL_pt” method does not require neither labels nor a deep learning model to filter the dataset. For curious readers, this article presents a comprehensive overview of different sampling strategies used in active learning.

Results

The results of the experiments are presented below.

We can see that the mAP score is low at small fractions of the training dataset. In addition, the mAP score saturates when using only 25% of the training data and reaches a value of 0.8. Above the saturation point, the mAP score increases very slowly until it reaches its highest value of 0.84. The saturation at low fractions of the training dataset indicates that there are many redundancies in the dataset.

Moreover, we can notice that for small fractions, i.e 5%, the “WTL_CS” filtering method is significantly better than the random baseline. As for high fractions, i.e 85%, the “WTL_pt” is able to achieve the same performance achieved when using the full training dataset. The “WTL_unc” method is on par or worse with the random sub-sampling method “RSS”.

Given that the saturation is reached within a small fraction of the training dataset, a “Zoom-in” experiment was performed where we evaluated the model’s performance using fractions of the training dataset between 5% and 25%. In this experiment, we dropped the “WTL_unc” due to its poor performance.

In the results above, it is observed that the sampled subsets using “WTL_CS” and “WTL_pt” methods consistently outperform random sub-sampling. In addition, using only 20% of the training dataset, the “WTL_CS” sampling method is able to achieve a mAP score of 0.80. We achieve 90% of the highest mAP score using only 20% of the training dataset.

‍

Why do “WTL_CS” and “WTL_pt” perform better than random sub-sampling “RSS”?

To answer this question, simple comparison was made between the images selected with the “RSS” method and the images selected with “WTL_CS” and “WTL_pt”. For this purpose, we computed the fraction of images from camera 1 in the selected samples for different fractions of the training dataset and for different filtering methods. This comparison is done in both the normal and the zoom-in experiments. Note that in the training dataset, the original fraction of images from Camera 1 is around 20%.

We can observe that the sampling methods “WTL_CS” and “WTL_pt” selected more samples from Camera 1 and therefore, they re-balanced the sub-sampled training dataset. This explains the gain in performance obtained using different samplings other than random sub-sampling. Since both “WTL_CS” and “WTL_pt” methods select non-redundant data, they choose more images from camera 1, and therefore the sub-sampled dataset is more diverse.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.

Book a Demo

Testimonials

What engineers say after adopting Lightly

No fluff—just results from teams using Lightly to move faster with better data and models.

"We had millions of images but no clear way to prioritize. Manual selection was slow and full of guesswork. With Lightly, we just feed in the data and get back what’s actually worth labeling."

Carlos Alvarez

Machine Learning Engineer

"Through this collaboration, SDSC and Lightly have combined their expertise to revolutionize the process of frame selection in surgical videos, making it more efficient and accurate than ever before to find the best subset of frames for labeling and model training."

Margaux Masson-Forsythe

Director of Machine Learning

“Lightly enabled us to improve our ML data pipeline in all regards: Selection, Efficiency, and Functionality. This allowed us to cut customer onboarding time by 50% while achieving better model performance.”

Harishma Dayanidhi

Co-Founder/ VP of Engineering

“By integrating Lightly into our existing workflow, we achieved a 90% reduction in dataset size and doubled the efficiency of our deployment process. The tool’s seamless implementation significantly enhanced our data pipeline.”

Usman Khan

Sr. Data Scientist

“Lightly gave us transparency to a part of the ML development that is a black box, data. Furthermore, Lightly enabled us to do Active Learning at scale and helped us improve recall and F1-score of our object detector by 32% and 10% compared to our previous data selection method. We finally saw the light in our data using Lightly.”

Gonzalo Urquieta

Project Leader

"Lightly is hyper-focused on finding thousands of relevant images from millions of video frames to improve deep learning models. The Lightly platform enabled us to build models and deploy features more than 2x faster and unlock completely new development workflows."

Isura Ranatunga

Co-Founder and CTO