Optimizing Generative AI: The Role of Data Curation

Exploring the pivotal role of data curation in Generative AI: A deep dive into experiments with diffusion models. Discover the balance between data quality and model efficacy.

In the field of artificial intelligence, the emphasis has often been on amassing more data. But as generative AI models, especially in computer vision, gain prominence, the focus is shifting towards the quality of data over its quantity. The move towards self-supervised learning methods seemed to increase the need for volume. However, our research into data curation for generative AI suggests otherwise.

This article delves into the role of curated data in generative AI and aims to address a central question: How does data curation impact the optimization of generative AI models in computer vision?

Example training images from the virtual tryon dataset used in our experiments. These images are getting center cropped and resized to 128x128 pixels for training the diffusion models.

Data in AI Training: Evolution and Implications

Deep learning’s advancement is fundamentally tied to the data it consumes. Traditionally, vast data volumes were believed to optimize model performance, prompting a race to acquire and deploy as much data as possible. However, recent shifts towards self-supervised learning, as seen in foundational models such as CLIP (Radford et al., 2021) and LLAMA (Touvron et al., 2023), challenge this belief. These models use vast corpuses, comprising millions of images or billions of tokens, that go beyond human annotation capabilities.

Yet, a growing body of research suggests that the sheer volume isn’t the sole key to success. Papers like LLAMA (Touvron et al., 2023), DINOv2 (Oquab et al., 2023), LLAVA (Liu et al., 2023), and the recent Emu (Dai et al., 2023) and MetaCLIP (Xu et al., 2023) all indicate a consistent pattern:
Models can achieve superior performance when fine-tuned or trained from scratch on smaller but high-quality datasets.

For instance, the PixArt-α (Chen et al., 2023) model highlights that improved captions in image-text pair datasets notably enhance vision-language models.

Figure 2 from PixArt-α paper shows the reduction of training data and compute required. One of the three major contributions in this paper is the improved training dataset quality.

Given this backdrop, our investigation centers on the impact of data curation methods on the training of generative AI models, specifically diffusion models. By scrutinizing data curation’s role in this domain, we aim to provide a more nuanced understanding of optimizing AI training.

We start out doing experiments on data curation for Generative AI models in the computer vision domain. More specifically, we try to answer the question which data curation methods have the biggest impact for training high quality diffusion models.

When we set out with the experiments we had the following hypothesis:

  • Generative models such as GANs and Diffusion models benefit from having diverse training data
  • Outliers harm the training process of generative models as it is inherently difficult to learn concepts from few examples

Significance of Data Curation in AI Training

The efficacy of a generative model is heavily contingent on the data it’s fed. Data curation emerges as an essential process here, primarily for two reasons:

  1. Quality over Quantity: As suggested by papers like LLAMA, DINOv2, and LLAVA, superior model performance can often be achieved with smaller, high-quality datasets than massive, uncurated ones. Data curation ensures that training datasets are devoid of noise, irrelevant instances, and duplications, thus maximizing the efficiency of every training iteration.
Table from DINOv2 paper showing the results of the same model and training and evaluation procedure with varying training only the training data. The last two rows show a subset of LVD-142M once sampled randomly (3rd row) and once using data curation (last row). As you can see the curated data yields better performance of the trained model across all evaluation datasets and was on par for ADE-20k.
  1. Guided Data Distribution: In the absence of data curation, we remain at the mercy of raw datasets, with limited control over data distribution. Data curation allows for a nuanced selection of data points, ensuring the models aren’t skewed due to biases or disproportionate representation. Techniques ranging from simple deduplication to sophisticated data selection algorithms can be employed to finetune this distribution, ensuring the trained models behave predictably and effectively.

Experiments: Probing Data Curation’s Impact on Generative AI Models

In this section we describe the experiments as well as the evaluation protocol we used.

Dataset and Preprocessing

We employed the Virtual Tryon Dataset for our experiments, consisting of 11,647 images. To ensure model compatibility and maintain data quality, we subjected the images to:

  • Center cropping
  • Resizing to a resolution of 128x128 pixels.

Model Architecture and Training Parameters

Our experiments centered on the Denoising Diffusion Probabilistic Model (DDPM) sourced from this GitHub repository. Key training parameters include:

  • Batch size: 32
  • Training duration: Approximately 12 hours on a single RTX 4090 GPU
  • Iterations: 70,000 steps for the full training set (~11k images) and 20,000 steps for all 1k subset experiments.

Embeddings and Sampling

We leveraged CLIP ViT-B/32 for subsampling and DINOv2 ViT-L/14 embeddings for our metrics, providing a comprehensive evaluation framework for our generative outputs.

We use two different embedding models for sampling and metrics to make the evaluation metrics more independent from the data subsampling method.

We evaluate the following sampling methods in our experiments:

  • Random: We randomly subsample 1,000 images from the full training set
  • Coreset: We use the Coreset algorithm to find the 1,000 most diverse images based on their CLIP embeddings.
  • Typicality: We use a mix of diversity and cluster density to subsample 1,000 images.

The big difference between Coreset and Typicality is that Coreset includes all the outliers as well as they are far away from the cluster centers. Coreset ignores the density of the data distribution. Coreset does not select nearby duplicates as they would be too close to each other.

Typicality on the other hand tries to find samples are the in the dense regions (have many similar samples) while still keeping a distance between the selected samples. This approach is neither selecting outliers nor nearby duplicates.

Metrics and Evaluation

Consistency in evaluation is paramount. We adopted the following metrics:

  • FID (Frechet Inception Distance, Heusel et al., 2017): A widely accepted metric for evaluating generative models.
  • Precision & Recall (Kynkäänniemi et al., 2019): To assess model accuracy and its ability to capture data distribution.

All models underwent evaluations from  10,000 sampled images. The FID values (mean + std) from two seeds were reported to ascertain model reliability.


In total we trained over 8 different diffusion models (4 experiments with two seeds each). We first compare the results of the metrics for the different subsampling methods.

Metrics-based Evaluation

We evaluate FID (lower is better), precision (higher is better) and recall (higher is better). The following plots show the mean and standard deviation of the results.

We report FID (mean + std) values for running experiments with two seeds. All models have been trained for 20k iterations with the exact same training parameters. Better models have lower FID, higher Precision and higher Recall.

Looking at the plots you see that there is a clear correlation of lower FID and higher precision & recall values. This is expected as all metrics try to capture the quality of the generated data distribution. Interestingly, the coreset method performs the worst having the highest FID and lowest precision & recall values. Our assumption that having many edge cases disturbs the training process seems to be valid. We suggest further research in this direction validate this claim and preliminary results.

What is surprising is that the Typicality data selection method is able to outperform the random subsampling approach. On first thought random should perform best as it matches the exact training distribution of the full training set. However, these first results indicate that having a balance between diversity and typicality can benefit the training process of the models.

We also put these experiments in perspective of training a model on the full dataset consisting of 11,647 training images.

We compare the different metrics for trained models on the subset as well as trained models on the full training set. The full training set consists of 11647 training images and the subsets of 1000 training images.

We can conclude the following ranking of the models:

  • 1st place: Full dataset (11647 images)
  • 2nd place: The 1000 images typicality subsampled
  • 3rd place: The 1000 images randomly subsampled
  • 4th place: The 1000 images Coreset subsampled

Human Study Evaluation

To further assess the perceived quality of the different models, we also conducted a user study. The goal is to evaluate which subsampling method creates the best results based on human perception. Furthermore, we want to assess how well FID, precision, and recall metrics reflect the human rating.

Screenshot of the user interface used for the rating mechanism. The user can either use the mouse or keyboard shortcuts to indicate which image is preferred.

We followed recent papers like the Dalle-3 research paper, 2023 and setup an evaluation pipeline with the following properties:

  • We sampled 9600 images per model (random, coreset, typicality).
  • We presented two random images from two different models to the user and asked them to vote either for one of the images if there was a clear preference, or for “not sure” if there was no clear preference.
  • We evaluate the win-rate of the different models to evaluate human preference.

We used a new web application solely developed for this sort of human evaluation called GenAIRater. We are still accepting additional votes under the following link :)

Results of our human evaluation. We present the win-rate of two methods compared against each other. E.g. the subset sampled using the coreset method wins 60% of the times against randomly subsampled training data. Typicality wins against Coreset as well as random.

The results from the user study can be summarized in the following ranking for the different models:

  • 1st place: The 1000 images typicality subsampled
  • 2nd place: The 1000 images Coreset subsampled
  • 3rd place: The 1000 images randomly subsampled

Visual Comparison

When working with generative AI models we can’t rely purely on metrics. Metrics can help us assess model convergence and to determine the diversity of the generated samples. A final visual quality check on randomly sampled images is a good practice. A user based rating where several participants rate different model outputs helps to identify which model actually creates the best data.

In the following we compare a random batch of 25 images sampled from the last checkpoint of the diffusion model training. For the full training dataset (11647 images) we sample the images after 70000 training steps. For the 1k training subsets we sample for 20000 steps as we noticed that the FID scores as well as other metrics would converge.

Examples of generated images from the DDIM model. The images show a random batch of 25 images each sampled from the last checkpoint. The model trained on the full dataset was trained for 70000 steps whereas the model trained on the 1k subset has been trained for 20000 steps.
Examples of generated images from the DDIM model. The images show a random batch of 25 images each sampled from the last checkpoint. The models trained on the 1k subset has been trained for 20000 steps.

After inspecting carefully the various generated images we can draw the following conclusions:

  • 1st place: Generated images from a model trained on the full dataset (11647 images) has biggest diversity and quality
  • 2nd place: The 1000 subsampled typicality subset has a good balance between keeping diversity and generating real humans
  • 3rd place: If we use 1000 randomly subsampled training images the sample diversity as well as the quality is slightly worse
  • 4th place: The 1000 subsampled Coreset tries to keep a higher diversity but fails to generate real humans


In this post, we looked at various ways to subsample datasets for training generative AI models. Based on preliminary results there seems to be a similar trend for training diffusion models as with other deep learning methods. The old myth of, the more data you have, the better the model becomes does not hold true.

Igor Susmelj,
Co-Founder Lightly

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us