What is redundant data and why should you avoid it? This article will tackle these questions in the context of computer vision by providing concrete examples. Data redundancy is shown to have negative repercussions on model accuracy and to be wasteful of resources.
Data redundancy often refers to having two identical samples of data in two different places within your database. Such a scenario can be accidental or deliberate, and its effects therefore vary. (Technopedia, 2020)
However, this article will focus on a slightly different understanding of data redundancy. We refer to redundant samples as nearby-duplicates within the same dataset. Indeed, this is data that is so similar it provides little value to the dataset. Further, data redundancy is assessed within the context of the other samples in the dataset (The 10% you don’t need). Thus the term “redundant” largely depends on the diversity of the dataset in terms of content, as well as the task the data is being used for.
One understanding of redundant data is semantic redundancy, which was described in a previous blog post. In this case, when the samples are represented in a vectorial space, the distances that separate them are minimal. In other words, they are visually very similar, as illustrated below.
Nevertheless, other examples of data redundancies include scene similarities between samples, representations with similar weather conditions or representations of the same object. Below the two images show both scene similarity, as the windscreen wipers appear in both images, as well as weather condition similarity, as both images were taken on rainy days.
Data redundancy is a challenge often observed for engineers dealing with video data for machine learning tasks. Indeed, videos are broken down into individual frames which can then be treated as images. However, in contrast to individual pictures which are taken deliberately at specific moments, video cameras continue to record and capture moments with little variance. To illustrate this, one can think of a car recording video footage of streets to train an autonomous vehicle. When the car stops behind another vehicle at a red light or when it is driving alone on a straight highway, the frames captured are quasi-identical.
In the The 10% you don’t need paper it is shown that several public datasets such as CIFAR-10 or ImageNet have at least 10% of redundant samples. If the distribution of classes within the dataset is skewed or excessive proportions of datasets are redundant, imbalances can result which can in turn can cause bias. The next section of this blog post touches upon this, by presenting reasons to avoid data redundancy.
Redundant images or video frames can have the following effects:
Model performance is highly influenced by the data the model is trained on. If you feed a model with redundant data, the model will perform well in these specific situations but will lack experience for others. Indeed, this is because some types of data are over-represented while others are underrepresented and thus training a model on redundant samples negatively influences the model’s generalization and accuracy. Therefore, an optimized data selection is crucial to reduce redundancy as much as possible. In the figure below, the model’s average precision obtained with standard sampling strategies (labeled “other”) is compared with the average precision of Lightly’s diversity-based sampling algorithm which actively excludes redundant samples. The graph shows that removing similar data can improve average precision or model accuracy.
In another experiment made by Lightly, it was shown that the best test accuracy was found by training the model on 90% of the data selected by Lightly in comparison to a model trained on 100% of the dataset or 90% using random sampling techniques. Read more about this in a previous blog post.
Precious resources such as time and money are wasted on redundant data. First, a lot of time and financial resources are spent on labeling datasets. Thus, labeling unnecessary samples is inefficient and costly. Second, compute and processing resources can be wasted on redundant data. Generally, as raw data goes through the typical machine learning pipeline (pictured below), data related tasks become more time-consuming and expensive if datasets are larger due to redundant samples.
This blog has portrayed the challenge of data redundancy within the computer vision field, by (1) providing a definition of the concept and (2) arguing why it should be avoided. You can read more about how Lightly helped AI Retailer System remove redundant data of customers in stores here.
Author: Sarah Meibom
Thank you to the Lightly team for reading drafts of this blog post.