Data Curation for

Improve your machine learning models by using the right data

Explore and curate unlabelled data

Use our self-supervised PIP package to power Lightly. Improve data quality and achieve better model accuracy. High quality data is more important than big data.

Find data redundancy and bias

Find and remove redundancy and bias introduced by the data collection process to reduce overfitting and improve ML model generalization.

10x more efficient

Save money on your data related costs by removing redundancies

Increased accuracy

Reduce overfitting and improve generalization by diversifying your dataset

Manage everything in one place

Understand your data within minutes after collection and before any data labeling.
We use self-supervised learning combined with active-learning to accelerate your data preparation pipeline.

Data Selection

Most companies only use between 0.1% and 10% of their data for machine learning. Use our state-of-the-art methods to select the most relevant samples. Let Lightly handle the selection of the data for you while you focus on the training process.

Smart Data Pool

Keep track of the data your team is working on. Our algorithms help you only adding relevant data to the existing pool. We only store non-sensitive meta-information on our servers so you don't have to worry about transfer costs or privacy issues.

Data Analytics

Use our deep data analytics framework to analyze your raw datasets. Get insights about the distribution, diversity, and other key metrics. Find dataset bias before training and evaluating your model.

Use Cases


Make your vehicle autonomous for the street, sea, or air.


Shipping, Logistics, Airline, Defense & Military


Visual Inspection

Detect defects in infrastructure, manufactured products, or find infected plants.


Railways & Roads, Infrastructure, Manufacturing, Agriculture, Surveillance & Security


Medical Imaging

Find abnormalities in medical images such as X-rays, MRIs, microscope & medical scans.


Health/Life Science, Biotechnology, and Digital Diagnostics/Pathology


Space Data

Improve space products and achieve better results


Sattelite Imaging, Visual Inspection for Space Components, Autonomous Systems


Easy data preparation based on your needs

We have the right solution for every amount of data. Use our on-premise
or  webapp  solution together with our
 PIP package  to analyze and filter your first dataset within minutes.

You can try out our limited free version with no payment required!

Lightly also offers an easy-to-use interface. The following lines show how the package can be used to train a model with self-supervision and create embeddings with only three lines of code

from lightly import train_embedding_model, embed_images

# first the model is trained for 10 epochs
checkpoint = train_embedding_model(input_dir='./my/cute/cats                             /dataset/',trainer={'max_epochs': 10})

# embeding 'cats' using the trained model
embeddings, labels, filenames = embed_images(input_dir='./my                         /cute/cats/dataset/', checkpoint=checkpoint)

# Inspecting the shape of the embeddings

The Lightly framework provides a command-line interface (CLI) to train self-supervised models and create embeddings without having to write a single line of code

# upload only the dataset
lightly-upload input_dir=cat token=your_token

# the dataset can be uploaded together with the embedding
lightly-upload input_dir=cat embedding=your_embedding.csv
                 token=your_token dataset_id=your_dataset_id

# download the dataset
# download a list of files

lightly-download tag_name=my_tag_name
                 dataset_id=your_dataset_id token=your_token

# copy files in a tag to a new folder
lightly-download tag_name=my_tag_name
                 dataset_id=your_dataset_id token=your_token
                 input_dir=cat output_dir=cat_curate

Example for using the docker container to analyze and filter the famous ImageNet dataset. The sample report can be replicated using the following command.

docker run --gpus all --rm -it \
                  -v /datasets/imagenet/train/:/home/input_dir:ro \
                  -v /datasets/docker_imagenet_500k:/home/output_dir \
                  --ipc="host" \
                  lightly/sampling:latest \
                  token=MYAWESOMETOKEN \
                  lightly.collate.input_size=64 \
                  lightly.loader.batch_size=256 \
                  lightly.loader.num_workers=8 \                   
                  lightly.trainer.max_epochs=0 \
                  stopping_condition.n_samples=500000 \
                  remove_exact_duplicates=True \

Our Interfaces

  • <1'000 samples
  • Drag n Drop (no coding required)
  • 2048-bit SSL encryption
  • Visual Analytics
Python PIP Package (CLI)
  • < 25'000 samples
  • Train custom embedding models using self-supervised learning
  • Option to only upload non-sensitive metadata
On-Premise (Docker)
  • Already used by Fortune500 companies to process > 1'000'000 samples
  • Neither your raw data nor metadata leave your server
  • Analytics reports

Customer Case Studies

AI Retailer Systems

Learn how AI Retailer Systems was able to reduce the data required to train an object detection model by 85% with almost no loss in accuracy thanks to Lightly.

"I was truly amazed once we received the results of Lightly. We knew we had a lot of similar images due to our video feed but the results showed us how we can work more efficiently by selecting the right data"

Alejandro Garcia, CEO 


"After training a model on the filtered data suggested by Lightly, I saw a dramatic increase in performance on our key metrics. Part of this is certainly because this was the first time we trained a model on any data that we've collected, but I'm fairly certain that performance would not have been as good if we had chosen what data to label at random."

Angelo Stekardis, Computer Vision Lead


"Lightly helped us understand more about our own data gathering process. Through their service, we were able to see, that a lot of data being collected was not meaningful enough for training an accurate model. This led us to change the way we gathered data and allowed us to ultimately create a much more information dense and higher quality dataset overall. Needless to say, the performance of our final model was greatly improved."

Nasib Adriano Naimi, Autonomy and Robotics Engineer

Our Blog

Sustainable AI and the New Data Pipeline

Deep learning's requirement of Big and Smart Data is currently met by labor-intensive processes of data labeling and cleaning. Self-supervised learning challenges this paradigm by enabling a more sustainable data pipeline.

The Advantage of Self-Supervised Learning

‍A few personal thoughts on why self-supervised learning will have a strong impact on AI. From recent NLP to computer vision papers.

Embedded COVID mask detection on an Arm Cortex-M7 processor using PyTorch

How we built a visual COVID-19 mask quality inspection prototype running on-device on an OpenMV-H7 board and the challenges on the way.

As seen on

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us