The How-To Guide to Data Curation in Machine Learning

Table of contents

Data curation ensures datasets are accurate, consistent, and reliable for analysis and machine learning. Beyond cleaning, it adds context, metadata, and governance, creating long-term value and trust in data-driven decisions.

Ideal For:
ML Engineers
Reading time:
7 mins
Category:
Data

Share blog post

Here is the quick summary of the points we have covered in this article:

TL;DR
  • What is data curation?

Data curation is the practice of maintaining accurate, consistent, and trustworthy datasets. Curated datasets act as reliable assets that support analysis, research, and decision-making. In machine learning, they provide the foundation for models that are both robust and generalizable.

  • Why is data curation important?

Messy, inconsistent data can derail projects by producing unreliable insights and weak models. Curation mitigates this by preserving data quality, integrity, and aligning datasets with standards. Well-curated data provides a trusted basis for analysis, research, and decision-making, ensuring humans and AI rely on reliable information.

  • What are the steps in the data curation process?

The workflow begins with collecting raw data, then cleaning it by correcting errors and handling missing values, followed by annotation if necessary. Next, transformation and integration normalize and merge sources, then add metadata and documentation. Finally, datasets are stored and shared in repositories or data warehouses.

  • Who is responsible for data curation? 

Dedicated data curators and data stewards usually lead the way. Curators enhance dataset quality and usability, while stewards establish governance, compliance, and preservation strategies. In practice, data engineers, data scientists, and analysts often share curation tasks, especially in smaller teams, where technical work and oversight overlap.

  • Is data curation the same as data cleaning?

Data cleaning involves tasks such as removing duplicates, correcting errors, and handling missing values. Data curation extends beyond fixing problems, adding context, metadata, and integration to ensure that datasets remain consistent, reusable, and valuable over time. Cleaning improves quality in the moment, while curation ensures long-term trust and usability.

Introduction

Every modern organization is drowning in data, but only a fraction of it is truly useful. 

In fields like machine learning and scientific research, messy and inconsistent data can slow model performance and lead to inaccurate predictions or flawed business decisions.

Smarter curation workflows ensure that only the most reliable and relevant data powers your models.

In this article, we will cover:

  1. What is data curation
  2. Why is data curation important?
  3. The data curation process
  4. LightlyOne data curation tutorial
  5. Common challenges in data curation and how to address them
  6. Real-world applications in AI/ML pipelines

Curating large datasets manually is slow, error-prone, and often results in redundant or biased samples being overlooked. Teams need a way to focus only on the most valuable data without wasting time and resources.

  • LightlyOne streamlines data curation by automatically selecting diverse and representative samples from massive datasets. This reduces redundancy, improves data integrity, and ensures models train on the most valuable data.

In parallel, LightlyTrain enables self-supervised pretraining on unlabeled domain-specific data, producing stronger feature representations. These representations speed up fine-tuning and improve model generalization, especially when labeled data is limited.

See Lightly in Action

Curate data, train foundation models, deploy on edge today.

Book a Demo

What is Data Curation?

Data curation is the process of turning raw information into reliable, usable datasets. It includes data collection, cleaning, integration, and metadata management. 

These measures ensure datasets are organized, consistent, and ready for long-term data preservation, compliant with data security standards, and aligned with data policies.

It’s a continuous task. 

Datasets must be reviewed regularly to ensure accuracy, completeness, and data accessibility. 

Figure 1: Data curation process.
Figure 1: Data curation process.

The Purpose of Data Curation

In computer vision, data curation refines raw images into usable datasets by removing duplicates, fixing labels, and balancing classes. This enhances performance in classification, detection, and segmentation tasks where label accuracy is crucial.

For vision–language models, the role of curation extends further. Large image–text datasets often contain noisy, ambiguous, or culturally biased captions that misalign with the visuals. 

Curating these pairs ensures that images are matched with precise, unbiased text, filtering out misleading examples. 

This alignment helps models learn genuine semantic relationships between vision and language, rather than memorizing shortcuts or artifacts in the data.  

Why Data Curation Matters

Data curation falls within the broader discipline of data management. It lies between storage and collection in the management lifecycle, bridging cleaning, annotating, and converting datasets into reusable formats.

Pro tip: Looking for the data annotation tool? Check out 12 Best Data Annotation Tools for Computer Vision (Free & Paid). 

Without curation, management systems can become sources of inconsistent, mislabeled, or fragmented data. This can lead to flawed business decisions, compliance risks, and wasted resources. 

Structured collection, accurate annotation, and careful transformation prevent these issues. They make sure that data entering repositories is not only available but also meaningful. 

This makes the ongoing management stronger by keeping datasets clear now and easy to maintain in the future.

Data Curation vs. Data Management

Although data curation and data management often work together, their focus and responsibilities differ. 

This table highlights those differences.

Table 1: The differences between Data Curation and Data Management.
Aspect Data Curation Data Management
Scope of Work Refines raw data through cleaning, data integration, and metadata creation to produce curated data assets. Handles storing data, applying institutional processes, and managing data atScalability.
Components Involves data identification, data collection, data transformation, metadata management, and curation activities such as record level curation or file level curation. Includes data repositories, cloud data storage, and continuous data enrichment.
Scalability Limited by tasks like handling missing values, labeling, and ensuring data quality. For example, large image datasets or historical records require automation and active learning to handle millions of samples without losing data quality. Enabled by infrastructure (cloud, data warehouses) to support petabyte-scale data assets.
Technical Complexity Deals with data quality assurance issues, bias, missing data, mislabeled samples, and schema mismatches. Deals with ensuring data integrity, compliance with legal documents and sensitive data standards, and securing pipelines to support effective data management.
Key Stakeholders Driven by data curators, data stewards, and data analysts who focus on curation, which involves organizing data and delivering valuable assets for reuse. Led by IT, engineers, governance teams, and data professionals who manage institutional processes, build data warehouses, and safeguard data assets.

Both data curation and data management complement each other to provide high data quality and integrity. When combined, they make data more accessible, reusable, and reliable for long-term value.

The Data Curation Process: Key Steps and Activities

The data curation network provides the processes and checks needed to make raw information accurate, consistent, and ready for advanced applications. Each stage builds on the previous one, strengthening data quality, integrity, and potential for reuse of the data. 

While specific methods may vary between teams, the following steps are common to most data curation workflows:

Data Identification & Collection

Curation begins with data identification. It involves determining what datasets are needed and where to source them. In computer vision, this means selecting images, video frames, or sensor outputs and ensuring the dataset has sufficient variation to minimize bias. 

Data collection focuses on acquiring relevant datasets through APIs, repositories, or capture pipelines. Standardizing formats early, like image resolutions or file types, reduces pipeline problems.

Data Cleaning 

Raw inputs often contain errors, duplicates, or inconsistencies that weaken reliability. For vision tasks, this may include corrupted image files, mislabeled classes, or near-duplicate photos that skew the results. 

Data cleaning resolves these by removing faulty samples and enforcing consistent formats. Automated preprocessing pipelines resize, normalize, and validate inputs so only accurate and consistent records move forward. 

Tools like Labelformat, an open-source library by Lightly, can further streamline this process by converting and validating annotations across formats such as COCO, YOLO, and Pascal VOC, making it easier to keep datasets clean and interoperable.

Data Annotation

Annotation provides structure by labeling and tagging raw data. Examples include drawing bounding boxes, marking keypoints, or applying segmentation masks to turn raw pixels into meaningful training data.

The primary challenge is precision and consistency, as annotators may perceive images differently or overlook edge cases. 

The use of automated tools and clear labeling guidelines can minimize errors and preserve quality.

Figure 2: Data annotation process.
Figure 2: Data annotation process.
Pro tip: For teams that need expert support, Lightly AI’s Data Annotation Services provide scalable, high-quality labeling for computer vision and LLM projects - check it out. 

Data Transformation & Integration

Transformation converts and normalizes data into consistent formats, such as scaling pixel values or unifying annotation schemas. This ensures comparability across datasets and prepares them for efficient model training.

Integration merges multiple inputs, like combining two object detection datasets with different labeling styles into a coherent whole. This expands coverage while reducing mismatches and redundancy.

Pro tip: Check out our Guide to Data Augmentation.

Metadata Creation & Documentation

Metadata provides essential context such as capture device, resolution, or lighting conditions. Without this, vision datasets can’t be reliably interpreted or reused, as key details about image origin and quality are lost.

Documentation further supports discovery and sharing by making datasets transparent. Standards like JSON or CVAT XML help keep records structured and reusable across projects.

Data Storage, Publication & Sharing

Once a dataset is curated, it needs to be stored in a way that is both accessible and secure. 

Visual data often comes in large volumes, such as collections of images or video, and this calls for storage systems that are efficient, scalable, and quick to retrieve from.

Effectively, shared data requires clear access rules, verified integrity, and licensing terms to support collaboration.

Ongoing Maintenance

Curation doesn’t stop once the data is stored. 

In visual tasks, it’s important to keep adding new environments, objects, or conditions over time. For example, autonomous vehicles need updated data for situations like night driving to avoid model drift.

Regular updates, validation, and re-annotation help keep datasets accurate and useful. This ongoing effort ensures models stay reliable and aligned with the real world.

LightlyOne Data Curation Tutorial

The following tutorial walks through how LightlyOne streamlines the data curation process step by step.

Upload Your Dataset

Begin by creating a dataset in LightlyOne. This dataset becomes the central workspace for embeddings, selections, and metadata. Connect your storage (AWS S3, Azure, or GCS) with two data sources:

  • Input Datasource: list/read access only.

  • Lightly Datasource: List/read/write access for thumbnails, embeddings, and outputs.
Figure 3: Dataset overview.
Figure 3: Dataset overview.

Schedule a Run to Compute Embeddings

Launch your first selection run with the LightlyOne Worker. During the run, the Worker generates embeddings. These are compact numerical representations of your images that capture their visual features and allow for similarity comparisons. You can monitor run stages such as EMBEDDING and SAMPLING.

Figure 4: LightlyOne workflow.
Figure 4: LightlyOne workflow.

Filter and Curate the Data

With embeddings in place, you can apply Lightly’s selection strategies to remove duplicates and uninformative samples:

  • Diversity strategy: Keeps samples distinct, reducing redundancy.
  • Similarity search: Finds and removes near-duplicate images.

This step creates a leaner dataset, which helps cut labeling costs and improve dataset diversity.

Explore Results Visually

Inspect your curated dataset through the platform’s embedding view. Scatter plots and coverage metrics reveal clusters, outliers, and duplicates. 

Lightly also supports visualizations such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) for deeper inspection.

Figure 5: Visualizing the generated embeddings.
Figure 5: Visualizing the generated embeddings.

Export for Labeling or Training

Once curation is complete, export your selection as filenames (with optional signed read URLs) or download files directly. These curated subsets can then be sent for labeling or used in training pipelines.

Figure 6: Exporting the dataset.
Figure 6: Exporting the dataset.

Using LightlyOne, you can convert messy image collections into valuable datasets by filtering duplicates, ensuring diversity, and focusing on informative samples.

Best Practices for Data Quality, Integrity, and Security

Strong data curation relies on clear principles that protect quality, integrity, and security. The curation practices below highlight how curated datasets can stay accurate, traceable, and ready for reuse.

Implement Rigorous Quality Checks

Datasets often contain duplicates or overly similar samples that waste labeling budgets and reduce diversity. To solve this, make sure only distinct samples are selected by setting a minimum distance between them in the embedding space.

LightlyOne helps with this by offering the DIVERSITY strategy, which includes a setting called stopping_condition_minimum_distance that prevents the selection of samples that are too similar.

"strategy": {
"type": "DIVERSITY",
"stopping_condition_minimum_distance": 0.2,
"strength": 0.6
}

Create Feedback Loops with Data Users

Curation needs to adapt as models uncover gaps or uncertain cases. LightlyOne supports this by using active learning. It first picks diverse samples, then adds the uncertain predictions flagged by the model during inference.

Ensure Clear Ownership and Accountability

Without clear ownership, teams can get confused about which data was curated or used for training. LightlyOne lets teams create and register datasets directly through its web interface or Python client.

Each dataset becomes a central place to store metadata, selected samples, and curation runs, which helps keep everything organized and transparent.

Apply the Principle of Least Privilege (Security)

Limiting access to only what is needed is a key part of safe data curation. LightlyOne supports this by integrating with AWS S3, Azure, and GCS using two data sources. One has read-only access for inputs, and the other has write access for curated outputs.

The LightlyOne Worker automatically verifies these permissions to keep things secure while still being easy to use.

Validate Data Integrity and Lineage

LightlyOne allows users to generate embeddings and explore them using dimensionality reduction techniques like PCA. These embeddings make it possible to run similarity searches and visual analyses.

Visualizing clusters, spotting outliers, and identifying duplicates helps users better understand how a dataset is structured and how it has evolved over time.

Common Challenges in Data Curation and How to Address Them

Even with well-defined curation practices, applications face challenges that can slow pipelines and compromise data quality.  Addressing these issues is key to maintaining efficient data curation and making datasets usable in data science and AI. The following points highlight a few issues related to computer vision, but they may also apply to other AI domains.

  • Missing or Incomplete Data: Vision data often lacks labels for uncommon road signs in autonomous driving, biasing performance. Techniques like semi-supervised labeling, data augmentation, or synthetic image generation can fill these gaps without bias.
  • Noisy and Dirty Data: Large image datasets are often contaminated with mislabeled examples (e.g., cats labeled as dogs), corrupted images, or near-duplicate images in video sequences. Automated cleaning pipelines that combine perceptual hashing and label validation are used to restore trust in training sets.
  • Heterogeneous Data and File Formats: Incorporating aerial drone images, CCTV footage, or smartphone photos can introduce discrepancies in resolution, color, and metadata. Normalizing formats and scales improves integration.
  • Scaling and Performance Issues: As datasets expand, pipelines need to balance volume with speed, making automation essential. Aigen, an agricultural robotics company, used Lightly to prune massive image datasets by 80–90% while preserving edge-case diversity. This cut compute costs and doubled deployment efficiency. 
  • Integrating Data from Disparate Sources: Urban planning involves matching schemas (e.g., GPS metadata, timestamp formats) to combine satellite images and street-level images. Mapping rules are automated to reduce mismatches and enhance the coherence of data sets.
  • Maintaining Data Consistency Over Time: Models trained on clear-weather traffic scenes may degrade when deployed in snowy or nighttime conditions. Continuous drift monitoring flags these shifts, prompting updates with diverse weather or lighting data.
  • Sensitive Data and Privacy Issues: Working with medical histories or legal documents involves a balance between compliance and access. Beyond corporate settings, non-profit data repositories also face privacy and compliance challenges. Encrypting storage not only secures but also focuses on understanding data for long-term reuse.
  • Institutional and Cultural Barriers: Teams may undervalue dataset curation compared to model development. In computer vision research labs, leadership support and structured annotation guidelines help embed curation as a standard practice, not an afterthought.

Well-curated pipelines protect an organization’s data from bias creeping into models, helping teams build systems that are both fair and reliable.

Real-World Applications in AI/ML Pipelines

Curated datasets anchor every stage of development in AI and ML pipelines. They provide high-quality training inputs, preserve lineage for auditability, and ensure compliance across sensitive domains.

With these foundations, models not only perform better but also remain reliable as data scales and evolves in production.

High-Quality Training Datasets for Better Models

Curated datasets resolve issues such as mispelled product names, missing ratings, or overlapping GPS entries that introduce noise. Standardizing formats across currencies, time zones, and categories keeps inputs consistent. 

Validating distributions ensures data reflect realistic proportions rather than skewed samples. These steps help models focus on reliable signals instead of errors, improving generalization and performance on unseen cases.

Onboarding Data Curators into ML Teams

Bringing a data curator into an ML team adds expertise in metadata management, data documentation, and dataset validation. This role supports engineers and scientists by making datasets reproducible, transparent, and easy to audit.

Using Curated Datasets for Scalable Reuse

Well-curated datasets can be reused across experiments without requiring the curation process to be repeated. This speeds up model development, lowers costs, and maintains consistent evaluation baselines. 

In practice, the ability to reuse data becomes a strategic advantage when scaling AI projects.

Maintaining Historical Data Records (Data Lineage in ML)

Tracking dataset versions and maintaining data lineage helps teams trace model outputs back to their sources. In fields such as medical histories, census data analysis, and other regulated areas, a strong lineage preserves accountability and prevents updates from erasing important context.

Continuous Data Curation in Production (MLOps)

Curated pipelines need to adapt as data evolves, especially in production environments. Automated monitors flag distribution shifts, schema mismatches, and missing records. This allows for corrections before performance declines. 

These ongoing practices build resilience into MLOps and keep models aligned with data quality and operational standards.

Domain-Specific Data Curation 

Curated data practices bring value across industries, each with its own critical use cases:

  • Healthcare: Data curation standardizes and balances medical records across populations. It helps address gaps and inconsistencies that create bias. With cleaner inputs, diagnostic models predict more fairly and support more reliable patient outcomes.
  • Finance: Fraud prevention relies on accurate anomaly detection among massive transactions. Data curation unifies financial streams by resolving fragmentation and noise. Integrating sources into consistent records makes anomalies clearer and false positives fewer.
  • IoT & Manufacturing: Sensor data often contains noise, misaligned signals, or equipment-specific quirks. In contrast, data-level curation involves filters that normalize these inputs. The result is predictive maintenance that stays accurate and helps prevent costly downtime.

Collaboration Between Domain Experts and Curators

Proper curation requires both technical workflows and domain knowledge. Domain experts provide context and understanding of data intricacies, while curators turn that into a structured form. Together, they turn raw data into reliable assets for decision-making.

How Lightly Helps in Data Curation: An Overview

As briefly mentioned at the beginning, one of Lightly’s products - LightlyTrain is built on self-supervised learning (SSL). This approach helps machine learning teams cut dataset size, reduce annotation costs, and improve model generalization.

Here’s a short overview of how LightlyTrain and LightlyOne help ML teams in their data curation process.

Smart Sample Selection

Lightly applies self-supervised models (e.g., SimCLR) to extract embeddings from raw images without labels. These embeddings are used to compute similarity matrices and detect redundancy. 

For dataset curation, LightlyOne uses embedding-based selection strategies to filter and prioritize data automatically.

These strategies include diversity, typicality, and similarity, which identify the most representative samples while reducing redundancy.

Embeddings & Visualization

Images are transformed into high-dimensional feature vectors (32–128D embeddings) that represent their semantic content. LightlyOne supports dimensionality reduction methods such as PCA, t-SNE, and UMAP to project embeddings into 2D/3D for exploration. 

This allows practitioners to visualize clusters, identify anomalies, and remove noise before training.

Efficiency & Scale

LightlyOne 3.0 introduces several performance improvements that make large-scale Data curation more practical. It delivers roughly 6x higher throughput on large datasets and uses about 70 percent less memory.

It also adds a typicality-based selection algorithm that prioritizes samples based on how representative they are, rather than relying only on diversity. These updates make it possible to curate millions of samples efficiently in production environments.

Reliable Similarity through Self-Supervised Losses

The lightly.loss module includes functions such as Barlow Twins, DCL, and NT-Xent, which are widely used in representation learning. 

These loss functions are incorporated in LightlyTrain pipelines to produce embeddings that support tasks like deduplication, diversity filtering, and typicality ranking.

This keeps curated datasets compact, representative, and well-aligned with downstream training needs.

Pro tip: Learn more about PyTorch Loss Functions here. 

Conclusion

Data curation ensures that raw information evolves into structured, high-quality datasets ready for long-term use. Each stage of the curation process protects data integrity, boosts accessibility, and mitigates risks associated with errors or bias.

Well-curated pipelines ensure organizations rely on trusted data for their models and analytics. In AI and ML pipelines, it supports the development of models that remain accurate, scalable, and dependable.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Stay ahead in computer vision

Get exclusive insights, tips, and updates from the Lightly.ai team.

Get Beyond ImageNet: Vision Model Pretraining for Real-World Tasks.

See benchmarks comparing real-world pretraining strategies inside. No fluff.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.