Scientific Dataset Studies

Learn more about how the Lightly subsampling method compares against random subsampling on well-known acadmic datasets.

Note: We only run the training data through our data selection solution. The test set stays the same. We do not recommend to do this in practice since train / test should have a similar distribution to properly evaluate a ML model. Additionally, these datasets went through a manual cleaning procedure to balance the dataset. We see on customer data much stronger impacts. Typically, we see the same test accuracy with 50% of the training data selected by Lightly as when using the full training dataset.


Kitti is a well-known dataset for autonomous driving for object detection.


CamVid is one of the first image segmentation datasets from 2007 and with little over 700 images for autonomous driving


CIFAR-10 is a well-known image classification dataset consisting of 10 classes.


Cityscapes has been released in 2016 and is commonly used for benchmarking segmentation models in autonomous driving. It consists of 5'000 images.
Download the complete dataset study
Download Study
Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us