The Best Data Curation Tools for Computer Vision in 2022

Integrating a quality data curation tool into your ML pipeline will have a direct impact on the quality and performance of your model. With so many solutions on the market, it can be difficult to get a clear understanding of which to choose. In this article, we describe the top data curation tools of 2022.

What is data curation and why is it important?

Data curation is a relatively new focus in the machine learning pipeline. Put broadly, it is the management of data throughout its lifecycle as it is used, evaluated, and reused. In practice, however, it involves using relevant tooling and filtering techniques to identify what data works and what data doesn’t. Additionally, data curation software typically allows teams to identify edge cases and measure key metrics from their dataset. 

Before ML teams started to pay more attention to data curation, raw data was simply labeled and deployed within a model. But with inevitable redundancy, bias, and insufficient data, this has been determined to be an incomplete process. Since the performance of a machine learning model can be traced directly back to the data used to train it, the quality of the data should be seen as the top priority in any ML operation.

This is where data curation comes in.


Figure 1: The machine learning pipeline, highlighting the data curation step as the most important regarding overall model performance. Illustration by author.


2022 will not only be the year of data curation– it is bound to be a particularly innovative year for the machine learning community at large. With a significant shift in the model-development paradigm from model-centric to data-centric AI in 2021, the priorities in most ML projects have changed. Instead of pursuing how to change their models to improve performance, engineers are increasingly turning their focus to how to systematically change their data

With all eyes on the data, data curation has become more crucial than ever. This means that the data curation tool you use has a direct impact on the quality and performance of your model. That’s why it’s important to understand all the data curation tool options available on the market today, to ensure you make the right choice for your particular use case.  

The best tools for data curation empower ML teams to:

  • Select the right data: You should be able to choose the most interesting samples within your dataset for model training, query data based on metadata (e.g. location), and rebalance + manipulate distributions to curate an ideal training dataset.
  • Easily integrate into data pipeline: A quality data curation tool should be able fit into your existing ML pipeline and storage.
  • Perform data-based model debugging: Find false positives and negatives, label mistakes, and other model errors in your data.
  • Visualize data: Be able to swiftly identify dataset bias on the distributions, outliers, and edge cases of your dataset.
  • Manage dataset: Collaboration features are important to ensure your team can seamlessly work together.

Keep reading to learn about the most commonly-used data curation tools on the market today and see if one could be a good fit for your computer vision project.


Data Curation Tools
  1. Scale Nucleus


Figure 2: Screen capture from Scale Nucleus


Scale Nucleus, launched in 2020 by Scale, allows teams to collaborate on the same platform and send data to be labeled by Scale or within the built-in editor in a few clicks. Currently, the platform supports image data only. Other valuable features include custom metadata searching, intuitive error debugging with error metrics, and integration with their API (but no support for on-premise deployability as of now).

Pros:

  • Directly integrates with the Scale labeling workforce
  • Predictions and labels can be visualized to find model failures in the labeled dataset
  • Users can query for similar images in the dataset

Cons:

  • Limited functionality when working with large unlabeled datasets
  • Users are locked in to the Scale ecosystem
  • Data selection is done manually through the user interface


  1. Aquarium Learning
Figure 3: Screen capture from Aquarium Learning 


Aquarium Learning is a machine learning data management platform focused on improving training data. It allows for a variety of data to be curated, including image, 3D, audio and text data. It is accessible via Aquarium’s API or their cloud platform, but not through an on-premise solution. Aquarium places a great deal of importance on features that allow teams to identify mislabeled samples, corrupted data, and edge cases, playing to the solution’s strength in maintaining/curating training datasets. 

Pros:

  • Supports use of embeddings to find clusters of similar images
  • Integrates with various labeling providers
  • Can be used to create and manage “data issues”

Cons:

  • Does not work with videos
  • Limited functionality when working with large unlabeled datasets
  • Data selection is done manually through the user interface


  1. Labelbox
Figure 4: Screen capture from LabelBox


Labelbox’s solution is focused on the training data iteration loop part of the ML pipeline. The platform is designed to iterate around three main tasks: annotating data, diagnosing model performance, and prioritizing data based on those results.  Collaboration features are also placed at the forefront of the platform to streamline teamwork on projects; these are particularly tailored to remote teams.

Pros:

  • Directly integrates with the Labeling tool of Labelbox
  • Supports use of embeddings to find clusters of similar images

Cons:

  • Data selection is done manually through the user interface


  1. Lightly 
Figure 5: Screen capture from Lightly 


Lightly is a data curation software for computer vision. Unlike other solutions, it scales to tens of millions of input images. It uses self-supervised learning to find clusters of similar data within a dataset. Lightly’s algorithms can then create a well-balanced and high-quality subset which the model can be trained and re-trained on. Ultimately, this results in the removal of overfitting, bias, and data patterns which lead to model failures. 

Another Lightly feature growing in popularity among ML teams is the on-premise docker solution. Using Docker, the entire solution can be run from within your own infrastructure, and your data never leaves your servers. Depending on the use case, this more secure option can be advantageous.

Pros:

  • Data selection can be done through active learning algorithms
  • Work with videos and scales to dataset of several millions of frames
  • On-prem version available
  • Doesn’t require any labels to work properly

Cons:

  • Focuses on data problems and not model problems

Want to learn more?

Many issues within the ML pipeline can be traced back to the data that is used to train models. Crippling issues like biased or overfitted models and high labeling costs can be avoided using a quality data curation tool to ensure that your model is fed the most important and highest quality data.

At Lightly we know a great model boils down to great data. That’s why we want to expose more computer vision teams to our data-centric solution. If you’d like to learn more about what we are doing at Lightly and what’s next for our solution, don’t hesitate to reach out to us here.


Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us