Exploring the pivotal role of data curation in Generative AI: A deep dive into experiments with diffusion models. Discover the balance between data quality and model efficacy.
In the field of artificial intelligence, the emphasis has often been on amassing more data. But as generative AI models, especially in computer vision, gain prominence, the focus is shifting towards the quality of data over its quantity. The move towards self-supervised learning methods seemed to increase the need for volume. However, our research into data curation for generative AI suggests otherwise.
This article delves into the role of curated data in generative AI and aims to address a central question: How does data curation impact the optimization of generative AI models in computer vision?
Deep learning’s advancement is fundamentally tied to the data it consumes. Traditionally, vast data volumes were believed to optimize model performance, prompting a race to acquire and deploy as much data as possible. However, recent shifts towards self-supervised learning, as seen in foundational models such as CLIP (Radford et al., 2021) and LLAMA (Touvron et al., 2023), challenge this belief. These models use vast corpuses, comprising millions of images or billions of tokens, that go beyond human annotation capabilities.
Yet, a growing body of research suggests that the sheer volume isn’t the sole key to success. Papers like LLAMA (Touvron et al., 2023), DINOv2 (Oquab et al., 2023), LLAVA (Liu et al., 2023), and the recent Emu (Dai et al., 2023) and MetaCLIP (Xu et al., 2023) all indicate a consistent pattern:
Models can achieve superior performance when fine-tuned or trained from scratch on smaller but high-quality datasets.
For instance, the PixArt-α (Chen et al., 2023) model highlights that improved captions in image-text pair datasets notably enhance vision-language models.
Given this backdrop, our investigation centers on the impact of data curation methods on the training of generative AI models, specifically diffusion models. By scrutinizing data curation’s role in this domain, we aim to provide a more nuanced understanding of optimizing AI training.
We start out doing experiments on data curation for Generative AI models in the computer vision domain. More specifically, we try to answer the question which data curation methods have the biggest impact for training high quality diffusion models.
When we set out with the experiments we had the following hypothesis:
The efficacy of a generative model is heavily contingent on the data it’s fed. Data curation emerges as an essential process here, primarily for two reasons:
In this section we describe the experiments as well as the evaluation protocol we used.
We employed the Virtual Tryon Dataset for our experiments, consisting of 11,647 images. To ensure model compatibility and maintain data quality, we subjected the images to:
Our experiments centered on the Denoising Diffusion Probabilistic Model (DDPM) sourced from this GitHub repository. Key training parameters include:
We leveraged CLIP ViT-B/32 for subsampling and DINOv2 ViT-L/14 embeddings for our metrics, providing a comprehensive evaluation framework for our generative outputs.
We use two different embedding models for sampling and metrics to make the evaluation metrics more independent from the data subsampling method.
The big difference between Coreset and Typicality is that Coreset includes all the outliers as well as they are far away from the cluster centers. Coreset ignores the density of the data distribution. Coreset does not select nearby duplicates as they would be too close to each other.
Typicality on the other hand tries to find samples are the in the dense regions (have many similar samples) while still keeping a distance between the selected samples. This approach is neither selecting outliers nor nearby duplicates.
Consistency in evaluation is paramount. We adopted the following metrics:
All models underwent evaluations from 10,000 sampled images. The FID values (mean + std) from two seeds were reported to ascertain model reliability.
In total we trained over 8 different diffusion models (4 experiments with two seeds each). We first compare the results of the metrics for the different subsampling methods.
We evaluate FID (lower is better), precision (higher is better) and recall (higher is better). The following plots show the mean and standard deviation of the results.
Looking at the plots you see that there is a clear correlation of lower FID and higher precision & recall values. This is expected as all metrics try to capture the quality of the generated data distribution. Interestingly, the coreset method performs the worst having the highest FID and lowest precision & recall values. Our assumption that having many edge cases disturbs the training process seems to be valid. We suggest further research in this direction validate this claim and preliminary results.
What is surprising is that the Typicality data selection method is able to outperform the random subsampling approach. On first thought random should perform best as it matches the exact training distribution of the full training set. However, these first results indicate that having a balance between diversity and typicality can benefit the training process of the models.
We also put these experiments in perspective of training a model on the full dataset consisting of 11,647 training images.
We can conclude the following ranking of the models:
To further assess the perceived quality of the different models, we also conducted a user study. The goal is to evaluate which subsampling method creates the best results based on human perception. Furthermore, we want to assess how well FID, precision, and recall metrics reflect the human rating.
We followed recent papers like the Dalle-3 research paper, 2023 and setup an evaluation pipeline with the following properties:
We used a new web application solely developed for this sort of human evaluation called GenAIRater. We are still accepting additional votes under the following link :)
The results from the user study can be summarized in the following ranking for the different models:
When working with generative AI models we can’t rely purely on metrics. Metrics can help us assess model convergence and to determine the diversity of the generated samples. A final visual quality check on randomly sampled images is a good practice. A user based rating where several participants rate different model outputs helps to identify which model actually creates the best data.
In the following we compare a random batch of 25 images sampled from the last checkpoint of the diffusion model training. For the full training dataset (11647 images) we sample the images after 70000 training steps. For the 1k training subsets we sample for 20000 steps as we noticed that the FID scores as well as other metrics would converge.
After inspecting carefully the various generated images we can draw the following conclusions:
In this post, we looked at various ways to subsample datasets for training generative AI models. Based on preliminary results there seems to be a similar trend for training diffusion models as with other deep learning methods. The old myth of, the more data you have, the better the model becomes does not hold true.
Igor Susmelj,
Co-Founder Lightly
Get exclusive insights, tips, and updates from the Lightly.ai team.