Data Selection for Computer Vision in 5 Steps

In its many forms, data acts as the learning material for machine learning models. Indeed, the performance of machine learning models is dependent on the quality of the data it is trained on. Therefore, selecting the best training dataset is equally important than developing the model itself.

This blog post suggests five chronological steps to select data for computer vision tasks: (1) understanding collected data, (2) defining requirements for the training dataset, (3) sampling the best subset with diversity-based sampling and self-supervised learning, (4) improving the model iteratively with uncertainty-based sampling and active learning, and finally, (5) automating the data pipeline by continuously adding the right samples to the data repository.

1. Understanding collected data

The first step towards having a high-quality training set is understanding the collected data at disposal to train computer vision models on. Although data attempts to represent the real world, the data collection process can contain biases, as data might be collected at certain times of the day, by a single person, or with a specific purpose in mind. Hence, the importance of uncovering such errors and gaining a better understanding of the large repository of images or video frames before selecting relevant samples for training. Below are a few examples of questions engineers could ask about the collected data:

  • What does the dataset’s distribution look like?
  • Does the dataset represent the real world?
  • Is the data redundant?
  • Is the dataset biased?
  • Which data is there too little/too much of?
  • Are all edge cases/corner cases sufficiently represented?

Such questions refer to common data challenges in the computer vision field including the long-tail distribution problem (where particular categories, classes, or situations are less represented with data) and the existence of false positives and false negatives (where model predictions or labels falsely relate to or fail to relate a sample to a certain object, class, or situation).

For computer vision tasks, data types include images or video footage. Such data is considered unstructured data, as they have no predetermined structure. Indeed, anything can appear anywhere on an image or video frame. This is why they are challenging to analyze, and a large number of samples give the impressions of being a blackbox. 

One solution to tackling raw and mysterious data is unsupervised learning. This method does not require any labels or further information about the images other than their mapping in a multi-dimensional vector space. Lightly’s tool uses such techniques. Indeed, the tool identifies clusters of similar data by calculating the distance between their vectors. It allows for a simpler visualization and understanding of the dataset by reducing the dimensionality into 2D scatter plots (several methods can be used, such as UMAP, TSNE or PCA). This is illustrated in the image below with the Comma10k dataset.


2D representation of the Comma10k dataset on the Lightly platform


In order to tackle the long-tail problem, Lightly’s similarity search function allows the manual identification of similar under-represented edge or corner cases. Below, similar images where windshield wipers block some of the image’s content have been spotted. 


Edge-case mining with Lightly’s similarity search function on the Comma10k dataset


2. Defining requirements for the training dataset

Before sampling a subset of the collected data for training, the requirements for such subset should be set. One can define what the ideal training dataset looks like in terms of metadata distribution as well as size, for instance. Obviously, many other metrics can be relevant when defining the ideal training dataset such as data diversity or model certainty. These will be discussed in another section.

The dataset’s metadata distribution refers to parameters which describe the content of the image (such as luminance and color channels) or the conditions in which the image was captured (time, temperature, and location). The primary task is to define what the dataset should ideally represent according to these and in which proportions. Data querying based on metadata information is one sampling technique that can be employed for data selection. Other methods include diversity-based and uncertainty-based sampling, two approaches which will be discussed in the following steps. Below are some questions which could be asked in order to uncover the subset’s ideal distribution of metadata:

  • What does the real world look like?
  • Should the dataset exactly represent the real world?
  • Which samples with certain metadata values are over-/under-represented? 
  • Which proportions of samples should represent different metadata values?

Setting target ratios for the metadata values is an approach to define an ideal subset. Below, histograms of certain parameters can be observed. For instance, one could aim for the different values of luminance to be more evenly represented in the dataset in terms of the number of samples. These histograms enable the querying of a subset based on metadata ratios. The dataset can also be adjusted according to these parameters by altering their minimum and maximum values.

Histograms of metadata on the Lightly platform


Quality over quantity also applies to datasets for model training. Indeed, having a smaller representative dataset is more valuable than a large redundant one. The desired size of the selected dataset can depend on various parameters. Here are a few questions that are asked: 

  • Which resources do I have to label the data?
  • How complex is the model’s task?
  • How long does it take to process the data?

As the subset is typically labeled before being sent to model training, the costs and time the labeling process incur play an essential role. The costs can be significant, hence the importance of reducing the amount of data and selecting the right data to be labeled. Moreover, the complexity of the task being taught to the computer can influence the required dataset size, as a complex task entails numerous types of scenarios which must be sufficiently represented within the dataset. For example, a task in an uncontrolled (e.g. outdoors) environment in comparison to the same task in a controlled (e.g. indoors) environment produces a variety of additional situations, most notably influenced by weather or lighting conditions. A further parameter can be processing time. As datasets become significant in terms of size, the time needed for model training increases as well. This, in turn, delays production. Thus, while the selection of a large dataset suggests that more cases and scenarios are covered, processing time is lengthened which can be more harmful and costly. 

In sum, defining the requirements for a subset depends on a variety of parameters that are very specific to the task, industry, company size, models used, and resources. It is thus an individual yet important factor to consider for data selection.


3. Sampling the best subset: diversity-based sampling with self-supervised learning

Once one has gained insight into the similarity and metadata distribution of the raw collected data and defined the requirements for a training dataset, selecting the relevant samples becomes the next challenge. Listed below, are a few simple data selection methods commonly employed:

  • Selection of frames based on metadata
  • Random sampling techniques 
  • Manual sampling

However, these selection methods are far from ideal. First, selection based on metadata such as time, where frames are sampled at a constant time interval, is a method that runs the risk of omitting interesting samples. Second, although random sampling seems like a reliable method, as statistically the distribution of the larger dataset should be reflected, it assumes that the collected data perfectly represents the real world, which is seldom the case. Finally, manual sampling methods cannot escape human error. Indeed, only machines are capable of making an accurate holistic selection.

Diversity-based sampling, a sampling technique based on the distance between the samples in an embedding space, is a more comprehensive and efficient approach to data selection. Its objective is to choose a diverse dataset which covers all possible scenarios, to ensure that the model is even trained on edge or corner cases. Self-supervised learning can offer a solution for this data selection task. As described previously, it has the powerful ability to identify similar images. Thereafter, samples from the different clusters can be chosen using a diversifying algorithm, thereby ensuring the representation of different types of images or video frames. Lightly uses such techniques. Engineers can choose the quantity of samples to be selected and the algorithm will identify the most diverse samples according to the distances between the data’s vectors. To illustrate this, in the image below, the most diverse 750 images from the Comma10k dataset have been selected by Lightly’s algorithm, appearing as the green points.


Data selection of the 750 most representative images from the Comma10k dataset on the Lightly platform


As mentioned in the previous section, sampling can also be done based on metadata. In the first image below, a subset of bright samples was created with Lightly’s metadata tool that allows for the minimum and maximum of luminance values to be altered. Here, higher values were aimed for, thus brighter images were selected. (Note how a few samples from the left cluster were still included in this selection. This is an interesting finding that could raise questions about their metadata values, labels, or ambiguous content .) In the second image below, a diversity-based sampling algorithm, such as the one previously described, was performed on these brighter samples. This resulted in a diverse subset of bright images.


Step 1: subset of bright samples from the Comma10k dataset selected using Lightly’s metadata tool

Step 2: data selection of the most representative bright images from the Comma10k dataset on the Lightly platform


By selecting a diverse training subset, redundant samples and over-represented data which can cause overfitting are removed, thereby improving model accuracy. The plot below shows the positive effects of Lightly’s diversity-based sampling algorithms on model accuracy for the KITTI public dataset.


Lightly’s sampling strategy performance in comparison to others with the KITTI dataset


4. Improving the model iteratively: uncertainty-based sampling with active learning

It is usually observed that with some samples the model shows good results while with others it struggles. The following list describes some of the reasons for poor-performing data:

  • Flawed requirements for the training subset
  • Unforeseen biases in the collected data
  • Edge cases/Corner cases insufficiently represented in the initial data repository

The model can perform poorly on data if it makes false predictions about it. For instance, the concepts of false positives and false negatives refer to situations when the model falsely assumes that the image includes an object and when the model incorrectly assumes that the image does not include that object when in reality it does.

Identifying false negatives and false positives is a task which can also be undertaken using the embedding visualization tool. In the first image below, the samples have been color-coded according to the luminance metadata values. While most darker images appear on the right (with a lower luminance value), a few lighter samples still appear in this cluster (suggesting false positives). This can also be done with model predictions or labels by observing the positioning of certain labeled samples in comparison to others. In the second image below, the samples of a labeled class from another dataset appear as green dots. Although most green points are grouped together, some appear further away (suggesting false positives), and while most green clusters are filled with samples of the same class, one cluster contains a gray gap (suggesting false negatives). One can then investigate why samples with such metadata values appear in these places and question the attributed class labels or predictions.


False positive and false negative spotting with metadata distribution


False positive and false negative spotting with class labels


In reference to the metaphor in the introduction, the model should be taught with challenging learning material, i.e. the data it struggles with, from which it can improve. The concept of uncertainty-based sampling performs exactly this: algorithms ensure that the data where the model is uncertain is added to the training set. This creates a type of feedback loop, called active learning, where the output of the model, i.e. its performance, gives recommendations regarding its inputs, i.e. its training set. 

Active learning is a feature that Lightly has incorporated into its core data curation product. The collected active-learning scores can also be projected on the scatterplot to identify the uncertain samples. Below the average precision of the active learning algorithm is compared to a random selection algorithm for the KITTI public dataset.


Lightly’s active learning algorithm in comparison to a random selection algorithm for the KITTI dataset


In combination, metadata-based, diversity-based, and uncertainty-based sampling allow for an optimal selection of data. It enables uncertain data to be included within the training set while ensuring that similar samples are left out and metadata targets are met. In consequence, model performance is considerably improved. You can read more about how Lightly implements this approach in another blog post about active learning. In comparison to manual similarity searches, also referred to as edge case mining, this process can be entirely automated, thereby making it scalable for larger datasets. This is covered in the next section.


5. Automating the data pipeline by continuously adding the right samples to the data repository

As computer vision tasks are performed over extended time periods, it is essential to maintain high model performance by updating the training dataset accordingly. Again, in reference to the metaphor, if models are taught with outdated learning material, their performance in the world today is inferior. This could happen in the case of computer vision models for the following reasons: 


  • The model is trained on outdated products, behaviors, environments, laws, etc. which no longer occur
  • New products, behaviors, environments, laws, etc. exist which provoke new situations that the model does not recognize


Thus, collecting more “modern” data and intelligently updating the training set by adding the right samples becomes necessary. As new data collection processes are carried out, the need arises for an automated data pipeline, as pictured below.  

Lightly’s automated data pipeline


Lightly enables data to be automatically added to the data repository and selected based on metadata, diversity, and uncertainty. With more data being collected for computer vision tasks, the feedback loop created with the active learning feature is essential. This way, data selection becomes a continuous yet seamless background task. Moreover, data preparation tools, such as labeling or model training platforms can be connected through an API, allowing data to freely flow between the steps. You can read more about this in next week’s upcoming article. Automation is crucial for scaling machine learning pipelines. Indeed, manual workflows such as edge case mining work well for smaller datasets, but as more data is collected the need for an automated workflow becomes more important.



Five important steps to data selection for computer vision tasks were described in this blog post. In sum, the steps highlight that an understanding of the collected data as well as the determination of requirements for the training dataset should precede the intelligent data selection process of the best subset. Data selection techniques include metadata-based sampling, diversity-based sampling, and uncertainty-based sampling. Finally, the importance of an automated data pipeline was presented.


If you want to learn more about Lightly’s tool, feel free to contact the team directly at sales@lightly.ai or head over to the website and reach out through the contact form.



Author: Sarah Meibom


Thank you to Matthias Heller and the Lightly team for reading drafts of this blog post.

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us