Data curation in computer vision lacks standardization, leaving many practitioners unsure how to do it correctly. We summarized some of the most common approaches.
This post has been written with data curation for computer vision in mind. However, several concepts can be applied to other data domains, such as NLP, audio, or tabular data.
What is Data Curation?
Data curation is a broad term widely used in the industry, especially in data-centric AI. It is important to understand the components of data curation in the machine learning context. We understand the following components under data curation:
- Data cleaning and normalization — Process of removing “broken” samples or trying to correct them
- Data selection — Process of ranking the data based on importance for a particular task
Data Cleaning in Machine Learning
The easiest way to understand data cleaning for structured data is to think about tabular data. Imagine you’re working in a bank on a project where you want to analyze customers' spending based on their origin. Your data is in a CSV file, and you discover that the location information is unavailable. Some entries have typos, such as misspelled cities or completely missing entries. You can now either clean the data by removing all the “broken” entries or try to correct the missing entries based on other available data. Open source libraries like fancyimpute and autoimpute that impute missing examples in tabular data exist.
When working with unstructured data such as images, you can check for images where the camera fails. A hardware issue could result in a broken video frame, or the recorded data could be of better quality (not enough light, blurry image).
With unstructured data, you often rely on supervised learning. It would be best if you also considered cleaning your data from “broken” labels or trying to correct them at least.
Data Selection (related to Active Learning)
As you might have heard before, not all data is equally important for your machine learning model to learn from. You don’t want to spend resources on data you don’t need.
When training machine learning models, you must ensure the data you use for training matches the data you expect once the system runs. This sounds simple, but in practice is a huge issue. Think about developing a perception system for an autonomous delivery robot. You have a prototype robot to select data from the city where you have your R&D. Your model will be trained on data from a single town but eventually need to work across all kinds of global cities. Different cities might have different architectures, environmental conditions, traffic signs, etc.
One could do a gradual rollout, deploy robots city by city and continuously collect more data to improve the perception system. But then, the initial cities contribute significantly more data than the last cities. How do you keep track of what is new, sound, or what is “redundant”?
What is a well-curated dataset?
First of all, there is no general perfect dataset per se. A dataset's value depends on the task you want to solve and other variables such as model architecture, training routine, and available computing power. Nevertheless, a well-curated dataset can help prevent running into any of the problems outlined in this guide.
Your dataset is ideally well-balanced. Your test set represents your model's deployment domain and is independent of the training data. Hence, you know about the generalization capabilities. Your dataset covers the edge cases you care about, and the labels are correct.
Data Curation Starter Guide
I have created a starter guide to help you identify problems in your machine learning pipeline. Use it as a reference to learn about the most common data curation workflows.
Note that this cheat sheet is not covering all possible issues and should be used as a helper in case you don’t know where to start.
How to work with the data curation starter guide:
- Identify the model problem you're facing on the left side
- Follow the arrows to find potential data problems
- Pick the data curation workflow that solves the problem
Common Model Problems and their Solutions
Overview of the most common data problems and suggestions for resolving them.
1: My model has a high train set accuracy, but a low test set accuracy
There are different reasons for this to happen. First, you should exclude common training procedure mistakes that could result in your model overfitting the training data. Adding more augmentations or regularization methods such as L2 or weight norm could help to reduce the risk of overfitting, as outlined by Andre Ng in this video.
There could also be other reasons for overfitting.
Once you know that it’s not a model training problem anymore but rather a data problem, you should look into the data you have for the training and the test set. Another possible reason is that your training data is not representative enough. It could be that some of the examples appear very infrequent, and the model cannot learn from them. In this case, a potential solution is to collect more data on the rare events your model is struggling with. Approaches like active learning have been proposed to tackle this problem in an automated and scalable way. BAAL is such an active learning algorithm from 2020.
2: Model fails on uncommon/rare cases
Typically, your model only performs well for some classes and situations. For example, rare classes are often neglected by common learning procedures. For specific applications, such as, for example, in medical imaging, a rare class could be more important than others. In this case, several solutions exist.
First, if your model performs bad for a specific class, also check if this class is underrepresented in your dataset. If this is a minority class, you should try to use weighted loss functions to account for this imbalance. It’s a simple trick that often yields already promising results.
If the problem persists, you can start thinking about how to solve that problem from a data point of view. There are two options. We can improve the class balancing or try finding more edge cases.
Improve the Class Balancing
You want to change the ratio of the classes in your training datasets to make them more equally proportioned. Several methods exist to handle different scenarios. If you work with lots of data, you could start throwing out samples from the majority class (undersampling) to balance the classes.
There is another approach if you work with small datasets or can’t afford to remove training. You could try to collect more data and prioritize classes that have been underrepresented previously. But how can you achieve the latter? You can use model predictions on the unlabeled data to get an idea of the data distribution. Whenever we find the majority classes, we put a lower priority on them. We increase the priority for potential predictions of rare classes. There exists an open-source library to work on imblanced data called imbalanced-learn.
Finding more edge cases using active learning
If you know exactly what you are looking for (e.g., the model is bad at detecting police cars), you can rely on similarity search methods like SEALS. You can take reference images of these rare objects and use their embeddings as a search vector to find similar-looking images or objects across the unlabeled data. Be careful when using similarity search. Suppose you only use images of a specific type of police car or a certain angle. In that case, you might find more similar police car images that don’t enhance your dataset as you add nearby duplicates. This can have a similar effect to just augmenting your initial dataset. Instead, you want similar police cars that look slightly different from the police cars you already have!
Another more general approach is to use a combination of model predictions and embeddings for active learning. You can add new images to the training set by finding unlabeled data where you have objects that are difficult to classify or semantically very different from the existing training data. At Lightly, we saw great success in this approach as it can be automated and scaled to large datasets.
3: Model gets worse over time
You might have an increase in failure rates and a gut feeling that the model you deployed a few weeks back is not working as expected. This could be a “data drift” problem. The model needs to be updated with new training data. First, you should analyze the data your model sees in production and compare it to the data you used for training the model. Likely there is a clear difference in the distribution.
As a simple experiment, you could also train a simple classifier model to classify whether a given image is part of the existing training data or the new production data. If the data distribution matched fully, your classifier would not perform better than chance. However, if the classifier works well, the two domains differ. After analyzing the domain gap, you should update the training data. You could, for example, select a subset of the production data using diversity-based sampling followed by a new train/test split. You then add the respective train/ test splits to the existing splits you used to train your deployed model.
The model would now be trained and evaluated using the following:
- train (initial) + train (new production data)
- test (initial), test (new production data)
Note that we recommend evaluating both test sets individually to ensure you can measure the metrics for the different distributions. This will also allow you to spot if the accuracy changed for the test (initial) set.
4: My model has high accuracy on the train and test set but then performed poorly after deployment
This is a widespread problem every ML model will face. When you start training and evaluating a model, you assume that the available training and test set follow a similar distribution as your model will face in deployment. If that is not the case, you don’t know how well your model performs. For example, a model to detect traffic signs trained and evaluated solely on data from California might miserably fail when deployed in Europe, where traffic signs look differently.
It’s crucial to ensure the model is trained and evaluated on data that matches its deployment environment as closely as possible. One way to prevent this issue is to continuously collect data and think carefully about data collection strategies to reduce any domain gap.
To spot potential issues around the generalization of your model earlier, you can use a different train test split that divides training and testing data not randomly but based on cities, for example.
5: Train accuracy low and test accuracy low
If the train and test accuracy are low, you have a model that didn’t learn much. There can be various reasons why the model is not improving:
- The task is too difficult — based on the available training data and the model's capacity, it’s too difficult to solve this problem.
- The labels are wrong — even if the data itself is very valuable. Without having correct labels, the model can’t learn anything.
Is the task too difficult for my model?
A simple yet effective method in computer vision to determine if a task is eventually too difficult to solve is by doing the “could I do it myself”-test. Given a few training images with labels and the task, could you correctly classify images from the test set?
If yes, you know that you could solve it and that the data might be good enough. Check if the model has enough training data to pick up the right signals.
Note: This works exceptionally well in computer vision, as humans are brilliant in pattern matching. If you work with other data types, this simple test might not work.
Dealing with bad labels
In supervised learning, the training data consists of samples (e.g., images) and the corresponding labels. Even if the samples in the dataset are representative and balanced, bad labels can result in a model learning nothing useful. It’s like teaching a kid in school to do math but with formulas and examples that are randomly arranged. Today you learn that 1+1=3 and tomorrow, that 1+1=5. If there is no consistent teaching pattern because the data is wrongly labeled, we have a problem.
You can do the following if you have a label issue. Randomly pick a small set (e.g., 100) samples from the train set and evaluate them manually for potential labeling mistakes. You do the same for the test set.
Having a few faulty labels is, unfortunately, very common. Several academic datasets such as ImageNet, CIFAR10 have a label error rate of around 5%. If the errors are not systematic and the error rate is not too high, it should not have a significant impact. But if you face systematic errors (e.g., all cats are labeled as dogs), you must correct them.
You might not want to relabel the whole train/ validation and test sets if you're working on large datasets. A more straightforward approach might be to correct the validation and test set. Having good evaluation datasets is very crucial. And for the training set, you can use methods such as co-teaching to train your model on noisy labels. Nowadays, many models are also pre-trained using self-supervised learning methods that don’t require labels. Therefore, these models won’t pick up the systematic errors in the training data. But still, you want to evaluate these models properly and therefore need good validation and test sets.
I hope you liked this post. If you have any suggestions on how we can further improve this guide, please don’t hesitate to reach out or leave a comment!
 Atighehchian et al. (2020), “Bayesian active learning for production, a systematic study and a reusable library”
 Coleman et al. (2020), “Similarity Search for Efficient Active Learning and Search of Rare Concepts”
 Haussmann et al. (2020), “Scalable Active Learning for Object Detection”
 Northcutt et al. (2021), “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks”
 Han et al. (2018), “Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels”)