Data Selection

Submit your Dataset

You can use our web app, our command-line interface, or the docker container to filter your dataset. The command-line interface comes in handy when already using a cloud server for your deep learning model training.

Select Parameters

We allow you to optimize your dataset for various tasks. The command-line interface, as well as the web application, allows for coarse optimization for classification, object detection, segmentation, and GANs. More fine-grained control can be achieved with the docker container.

Behind the Scenes

After you submit your dataset with your preferred parameters our data curation algorithms data filtering software analyzes it.  Lightly automatically removes corrupt files and rebalances the dataset on a feature level. Based on your filter preference nearby duplicates are removed or a new dataset is created based on the most important samples. We will share more details about how exactly we filter the datasets in this blog post. Click here.

After Filtering

You will be able to either download a list of final filenames or a clean dataset. Additionally, we provide you with a report showing you more details about how Lightly processed your data.

Read our whitepaper to learn more about Lightly

Easy data preparation based on your needs

We have the right solution for every amount of data. Use our on-premiseor  webapp  solution together with our
 PIP package  to analyze and filter your first dataset within minutes.

You can try out our limited free version with no payment required!

Lightly also offers an easy-to-use interface. The following lines show how the package can be used to train a model with self-supervision and create embeddings with only three lines of code

from lightly import train_embedding_model, embed_images

# first the model is trained for 10 epochs
checkpoint = train_embedding_model(input_dir=
           trainer={'max_epochs': 10})

# embeding 'cats' using the trained model
embeddings, labels, filenames = embed_images(input_dir='./my

# Inspecting the shape of the embeddings

The Lightly framework provides a command-line interface (CLI) to train self-supervised models and create embeddings without having to write a single line of code

# upload only the dataset
lightly-upload input_dir=cat token=your_token

# the dataset can be uploaded together with the embedding
lightly-upload input_dir=cat embedding=your_embedding.csv
                 token=your_token dataset_id=your_dataset_id

# download the dataset
# download a list of files

lightly-download tag_name=my_tag_name
                 dataset_id=your_dataset_id token=your_token

# copy files in a tag to a new folder
lightly-download tag_name=my_tag_name
                 dataset_id=your_dataset_id token=your_token
                 input_dir=cat output_dir=cat_curate

Example for using the docker container to analyze and filter the famous ImageNet dataset. The sample report can be replicated using the following command.

docker run --gpus all --rm -it \
                  -v /datasets/imagenet/train/:/home/
input_dir:ro \
                  -v /datasets/docker_imagenet_500k:
/home/output_dir \
                  --ipc="host" \
                  lightly/sampling:latest \
                  token=MYAWESOMETOKEN \
                  lightly.collate.input_size=64 \
                  lightly.loader.batch_size=256 \
                  lightly.loader.num_workers=8 \                   
                  lightly.trainer.max_epochs=0 \
                  stopping_condition.n_samples=500000 \
                  remove_exact_duplicates=True \

Our Technology

Lightly makes use of representation learning through self-supervised methods to understand raw data. It can therefore be used before any data annotation step. The learned representations can be used to analyze and visualize your datasets as well as for selecting a core set of samples that can then be used for further steps in the data preparation pipeline. The same algorithms power our active-learning library to help you with iterative active-learning loops.

Our Platform

Preparing and organizing data for machine learning has never been easier.  With our platform, anyone can become a data preparation engineer. Visual feedback helps you understand which samples are in your dataset and which have been removed. Keep track of different dataset versions using tags. Collaborate with your team and share the final datasets with your ML engineer training and evaluating models.


Our data preparation platform gives you unique insights into your raw data. Our analytics gives you information on the statistics of your dataset and provides graphs such as histograms.

A gif showing how to explore and sort samples in your dataset using Lightly's UI
A gif of embedding selection in action

You can also explore your raw data by using our automatically generated dimension reduction projections like UMAP, PCA and tSNE which empowers you to visually evaluate and select your data.

With every filtering you will also receive an automatically generated analytics report which gives you an overview over the most important statistics and figures. It also provides you sample images of kept and removed images for visual quality inspection. You can download a sample report (pdf) below.

Download an example analytics report for MS Coco
Download COCO Analytics Report
Download an example analytics report for ImageNet
Download ImageNet Analytics Report

Data Management & Collaboration

Our unique Smart Data Pool product comes with an inhouse-collaboration feature. It allows all of your teams to contribute to the company data warehouse. It follows a simple 3 step process (see illustration below).

  1. Team send data to the filter solution
  2. Filtered data is added to the raw data company pool
  3. Our Lightly data governance system checks whether the samples is already in the database

After the process the new samples are added to the "warm"/"on-cloud" storage. The complete raw data can still be stored in a "cold" storage if wished.

A graphic showing how Lightly's Smart Data Pool makes collaboration easy
Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.