You can use our web app, our command-line interface, or the docker container to filter your dataset. The command-line interface comes in handy when already using a cloud server for your deep learning model training.
We allow you to optimize your dataset for various tasks. The command-line interface, as well as the web application, allows for coarse optimization for classification, object detection, segmentation, and GANs. More fine-grained control can be achieved with the docker container.
After you submit your dataset with your preferred parameters our data curation algorithms data filtering software analyzes it. Lightly automatically removes corrupt files and rebalances the dataset on a feature level. Based on your filter preference nearby duplicates are removed or a new dataset is created based on the most important samples. We will share more details about how exactly we filter the datasets in this blog post. Click here.
You will be able to either download a list of final filenames or a clean dataset. Additionally, we provide you with a report showing you more details about how Lightly processed your data.
Lightly makes use of representation learning through self-supervised methods to understand raw data. It can therefore be used before any data annotation step. The learned representations can be used to analyze and visualize your datasets as well as for selecting a core set of samples that can then be used for further steps in the data preparation pipeline. The same algorithms power our active-learning library to help you with iterative active-learning loops.
Preparing and organizing data for machine learning has never been easier. With our platform, anyone can become a data preparation engineer. Visual feedback helps you understand which samples are in your dataset and which have been removed. Keep track of different dataset versions using tags. Collaborate with your team and share the final datasets with your ML engineer training and evaluating models.
Our data preparation platform gives you unique insights into your raw data. Our analytics gives you information on the statistics of your dataset and provides graphs such as histograms.
You can also explore your raw data by using our automatically generated dimension reduction projections like UMAP, PCA and tSNE which empowers you to visually evaluate and select your data.
With every filtering you will also receive an automatically generated analytics report which gives you an overview over the most important statistics and figures. It also provides you sample images of kept and removed images for visual quality inspection. You can download a sample report (pdf) below.
Our unique Smart Data Pool product comes with an inhouse-collaboration feature. It allows all of your teams to contribute to the company data warehouse. It follows a simple 3 step process (see illustration below).
After the process the new samples are added to the "warm"/"on-cloud" storage. The complete raw data can still be stored in a "cold" storage if wished.