Customer Success Stories

How Lightly Helped AI Retailer System Achieve 90% mAP While Slashing Labeling Costs

Lightly helped AIRS eliminate redundancy in video frame datasets, enabling them to reach 90% of their top mAP score using just 20% of the training data.

Alejandro Garcia
CEO
Overview

Lightly helped AIRS eliminate redundancy in video frame datasets, enabling them to reach 90% of their top mAP score using just 20% of the training data.

Industry
Retail
Location
Bern, Switzerland
Employee
>100

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo
Products
LightlyOne
Results
80%
Reduction in Annotation Costs
Use Case
Data curation for object detection in retail

About

Many interesting Deep learning applications rely on the use of complex architectures fueled by large datasets. With growing storage capacities and easier data collection processes[1], it requires little effort to build large datasets. However, when doing so, a new challenge surfaces: data redundancy. Many of these redundancies are systematically introduced through the data collection process. For instance, in the form of consecutive frames extracted from a video or very similar images collected from the web. In this blog post, the results of a benchmark study showing the benefits of filtering redundant data with Lightly are presented. The data was collected by AI Retailer Systems (AIRS), an innovative start-up developing a checkout-free solution for retailers. In this study, we consider an object detection task: an intelligent vision system recognizes products on a shelf or on a customer’s hand.

Redundancies can take multiple forms, the simplest one being exact image duplicates. Another form is near-duplicates, i.e images shifted with few pixels across some direction or images having slight light changes. Redundancies have also been observed in very known academic datasets: CIFAR-10, CIFAR-100, and ImageNet. This does not only lead to biased results of the model’s performance, be it accuracy or mean average precision mAP score, but also lead to high annotation costs.

Image for post
Short video sample extracted from AIRS video

The dataset provided by AIRS consists of images extracted from short videos capturing a customer grabbing different products. Two different cameras recorded videos of the shelf, each from a different angle, and 12 different kinds of products, i.e. 12 classes, were present.

Problem

Redundancies can take multiple forms, the simplest one being exact image duplicates. Another form is near-duplicates, i.e images shifted with few pixels across some direction or images having slight light changes. Redundancies have also been observed in very known academic datasets: CIFAR-10, CIFAR-100, and ImageNet. This does not only lead to biased results of the model’s performance, be it accuracy or mean average precision mAP score, but also lead to high annotation costs.

Image for post
Short video sample extracted from AIRS video

The dataset provided by AIRS consists of images extracted from short videos capturing a customer grabbing different products. Two different cameras recorded videos of the shelf, each from a different angle, and 12 different kinds of products, i.e. 12 classes, were present.

The dataset was manually annotated using the open-source annotation tool Vatic. Its annotation rate, a rate quantifying how many frames per time unit were labeled, was 2.3 ± 0.8 frames per minute. Given that there are 51 objects on average in each image, this is equivalent to 0.51 seconds for each bounding box.

Image for post
Sample image from Camera 1 with annotations (Note: the box color does not represent the article class)
Image for post

The annotated dataset has 7909 images. The training dataset has 2899 images, 80% of these images are from camera 2 and 20% from camera 1. For the test dataset, it has 5010 images and all of them from camera 1.

Image for post
Visualization of the train-test setting for AIRS dataset

This specific design of the train and test datasets was decided upon according to the following rationale: First, an imbalanced dataset with a high fraction of images coming from one camera is built. Second, the object detection task is made hard for the model. With this train-test setting, we can calculate the fraction of images from Camera 1 in the filtered data, and thereby observe if any re-balancing is introduced by the different filtering methods used. In the following section, the methods used in this case study are presented.

Testimonials

"I was truly amazed once we received the results of Lightly. We knew we had a lot of similar images due to our video feed but the results showed us how we can work more efficiently by selecting the right data"

Alejandro Garcia

CEO

Scalable and Efficient Data Curation using Lightly

Active learning and Sampling methods

To probe the effects of filtering the dataset, we borrowed ideas from the field of active learning.

Image for post
Active learning loop used in this case study

Active learning aims at finding a subset of the training data that achieves the highest possible performance. In this study, we used the pool-based active learning loop that works as follows: A small fraction of the training dataset, called the labeled pool, is the starting point. The model is then trained on this labeled pool. Thereafter, new data points that should be labeled are selected using the model along with a filtering method. The newly selected samples can then be added to the labeled pool and finally the model can be trained from scratch on the updated labeled pool. After each cycle, the model’s performance on the test dataset for each filtering method used is reported. In our case, 5% of the training data was used as the initial labeled pool, the model was trained for 50 epochs, and 20% of the training data was added in each active learning loop.

The object detection model used in this benchmark study is YOLO V3 (You Only Look Once) [4], along with the implementation provided by the Ultralytics Github repository. The code was slightly modified in order to introduce the active learning loop.

As for the filtering methods, four different filtering methods provided by Lightly were resorted to:

  • “RSS”: Refers to random sub-sampling, used as a baseline.
  • “WTL_unc”: This method refers to Lightly's uncertainty based sub-sampling. It selects difficult images that the model is highly uncertain about. The uncertainty is assessed using the model’s predictions.
  • “WTL_CS”: This Lightly method uses image representations to select images that are both diverse and difficult. It combines uncertainty-based sub-sampling with diversity selection. The image representations are obtained using state-of-the-art self-supervised learning methods using the PIP package Boris-ml. The advantage of self-supervised learning methods is that they don’t require annotations to generate image representations.
  • “WTL_pt”: Relies on pre-trained models to learn image representations. The filtering is performed by removing the most similar images. Similarity in this case is given by the L2 distance between image representations.

Both Lightly methods “WTL_unc” and “WTL_CS” use active learning, since they use the deep learning model to decide which data points to filter. In contrast, the “WTL_pt” method does not require neither labels nor a deep learning model to filter the dataset. For curious readers, this article presents a comprehensive overview of different sampling strategies used in active learning.

Results

The results of the experiments are presented below.

Image for post
Averaged mAP score for different fractions of the training dataset using 4 seeds

We can see that the mAP score is low at small fractions of the training dataset. In addition, the mAP score saturates when using only 25% of the training data and reaches a value of 0.8. Above the saturation point, the mAP score increases very slowly until it reaches its highest value of 0.84. The saturation at low fractions of the training dataset indicates that there are many redundancies in the dataset.

Moreover, we can notice that for small fractions, i.e 5%, the “WTL_CS” filtering method is significantly better than the random baseline. As for high fractions, i.e 85%, the “WTL_pt” is able to achieve the same performance achieved when using the full training dataset. The “WTL_unc” method is on par or worse with the random sub-sampling method “RSS”.

Given that the saturation is reached within a small fraction of the training dataset, a “Zoom-in” experiment was performed where we evaluated the model’s performance using fractions of the training dataset between 5% and 25%. In this experiment, we dropped the “WTL_unc” due to its poor performance.

Image for post

In the results above, it is observed that the sampled subsets using “WTL_CS” and “WTL_pt” methods consistently outperform random sub-sampling. In addition, using only 20% of the training dataset, the “WTL_CS” sampling method is able to achieve a mAP score of 0.80. We achieve 90% of the highest mAP score using only 20% of the training dataset.

‍

Why do “WTL_CS” and “WTL_pt” perform better than random sub-sampling “RSS”?

To answer this question, simple comparison was made between the images selected with the “RSS” method and the images selected with “WTL_CS” and “WTL_pt”. For this purpose, we computed the fraction of images from camera 1 in the selected samples for different fractions of the training dataset and for different filtering methods. This comparison is done in both the normal and the zoom-in experiments. Note that in the training dataset, the original fraction of images from Camera 1 is around 20%.

Image for post
The fraction of Camera 1 images in the sampled images as a function of fraction of the training dataset
Image for post
Zoom-in experiment: Fraction of Camera 1 images in the sampled images as a function of the fraction of the training dataset

We can observe that the sampling methods “WTL_CS” and “WTL_pt” selected more samples from Camera 1 and therefore, they re-balanced the sub-sampled training dataset. This explains the gain in performance obtained using different samplings other than random sub-sampling. Since both “WTL_CS” and “WTL_pt” methods select non-redundant data, they choose more images from camera 1, and therefore the sub-sampled dataset is more diverse.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo
Testimonials

What engineers say after adopting Lightly

No fluff—just results from teams using Lightly to move faster with better data and models.

"We had millions of images but no clear way to prioritize. Manual selection was slow and full of guesswork. With Lightly, we just feed in the data and get back what’s actually worth labeling."

Carlos Alvarez
Machine Learning Engineer

"Through this collaboration, SDSC and Lightly have combined their expertise to revolutionize the process of frame selection in surgical videos, making it more efficient and accurate than ever before to find the best subset of frames for labeling and model training."

Margaux Masson-Forsythe
Director of Machine Learning

“Lightly enabled us to improve our ML data pipeline in all regards: Selection, Efficiency, and Functionality. This allowed us to cut customer onboarding time by 50% while achieving better model performance.”

Harishma Dayanidhi
Co-Founder/ VP of Engineering

“By integrating Lightly into our existing workflow, we achieved a 90% reduction in dataset size and doubled the efficiency of our deployment process. The tool’s seamless implementation significantly enhanced our data pipeline.”

Usman Khan
Sr. Data Scientist

“Lightly gave us transparency to a part of the ML development that is a black box, data. Furthermore, Lightly enabled us to do Active Learning at scale and helped us improve recall and F1-score of our object detector by 32% and 10% compared to our previous data selection method. We finally saw the light in our data using Lightly.”

Gonzalo Urquieta
Project Leader

"Lightly is hyper-focused on finding thousands of relevant images from millions of video frames to improve deep learning models. The Lightly platform enabled us to build models and deploy features more than 2x faster and unlock completely new development workflows."

Isura Ranatunga
Co-Founder and CTO

Explore Lightly Products

Lightly One

Data Selection & Data Viewer

Get data insights and find the perfect selection strategy

Learn More

Lightly Train

Self-Supervised Pretraining

Leverage self-supervised learning to pretrain models

Learn More

Lightly Edge

Smart Data Capturing on Device

Find only the most valuable data directly on device

Learn More

Ready to Get Started?

Experience the power of automated data curation with Lightly

Book a Demo

Get Beyond ImageNet: Vision Model Pretraining for Real-World Tasks.

See benchmarks comparing real-world pretraining strategies inside. No fluff.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.