This blog was published on https://towardsdatascience.com/the-data-you-don-t-need-removing-redundant-samples-6bfd07c1516c
In ML there is the saying garbage in, garbage out. But what does it really mean to have good or bad data? In this post, we will explore data redundancies in the training set of fashion-MNIST and how it affects test set accuracy.
We leave the more detailed explanation for a next post but let’s give you an example of redundant data. Imagine you’re building a classifier, trying to distinguish between images of cats and dogs. You already have a dataset of 100 cats and are looking for a dataset of dog pictures. Two friends of yours, Robert and Tom, offer their dataset.
Which dataset of dogs would you pick?
Of course, it depends on the exact goal of the classifier and the environment it will run in production. But I hope that in the majority of the cases you would agree, that the dataset from Tom with 100 different dogs makes more sense.
Let’s assume every image has a certain amount of information it can contribute to our dataset. Images with the same content (same dog) add less additional information than images with new content (different dogs). One could say that similar images have semantic redundancy.
There are papers such as The 10% You Don’t Need exploring this in more detail.
Papers such as The 10% You Don’t Need follow a two-step procedure to find and remove less informative samples. First, they train an embedding. Then they apply a clustering method to remove nearest neighbors using Agglomerative Clustering. For the clustering, they use the common cosine distance as a metric. You could also normalize the features to the unit norm and use L2 distance instead.
There are two problems we need to solve:
The first problem the authors solve by training the embedding using the provided labels. Think about training a classifier and then removing the last layer to get good features.
The second problem needs more creativity. Let’s assume we have very good embeddings separating the individual classes. We can now process the individual classes independently. For a dataset with 50k samples and 10 classes, we would run 10 times the clustering on 5k samples each. Since time and space complexities are O(n³) and O(n²) this is a significant speedup.
At WhatToLabel we want to make the use of machine learning more efficient by focusing on the most important data. We help ML engineers filter and analyze their training data.
We use the same two-step approach.
First, we want to get a good embedding. A lot of our effort is focusing on this part. We have pre-trained models we use as a base we can fine-tune on a specific dataset using self-supervision. This allows us to work with unlabeled data. We will explain the self-supervision part in another blog post. The pre-trained model has a ResNet50 like architecture. However, the output dimension of the embedding has only 64 dimensions. High-dimensions introduce various issues such as higher computation and storage time and less meaningful distances due to the curse of dimensionality.
The way we train our embedding we still get a high accuracy:
Second, we want to use a fast algorithm for data selection based on the embedding. Agglomerative Clustering is too slow. We explore the local neighborhoods by building a graph and then iteratively run algorithms on it. We use two types of algorithms. Destructive ones, where we start with the full dataset and then remove samples. Constructive algorithms on the other hand start building a new dataset from scratch by only adding relevant samples one by one.
When we combine the two steps we get a fast filtering solution that even works without labels.
We have developed a product to do exactly that which you can check out at whattolabel.com or you can build your own pipeline following this two-step Procedure for Filtering.
For the following experiments, we use the WhatToLabel data filtering solution.
The full code to reproduce the experiments is provided in this GitHub repository: https://github.com/WhatToLabel/examples
Now, we know what redundancies are and that they can negatively influence training data. Let’s have a look at a simple dataset such as fashion-MNIST.
Fashion-MNIST contains 50k/ 10k in the train/ test set. For the following experiment, we only filter the training data. Our goal is to remove 10%. Once with a random selection of the samples and once with a more sophisticated data filtering method. We then evaluate the various subsampling methods against the full dataset. For the evaluation, we simply train a classifier.
We use PyTorch as a framework. For the evaluation, we train a resnet34 with SGD using the following parameters:
To ensure reproducibility we set the seed for random number generators:
We normalize the input data using the ImageNet statistics and for training data, we use a random horizontal flip as data augmentation. Additionally, we convert the B/W image to RGB.
In order to conduct multiple experiments with different % of training data, we use a simple trick. Within the data loader, we use a random subset sampler which only samples from a list of indices. For the WhatToLabel filtered dataset, we provide a list of indices within the repository. For random subsampling, we create our own list using NumPy.
Before training our classifier we want to rebalance the weight for the cross-entropy loss based on the number of samples for each class.
Now everything is in place and we can start training the model. On an Nvidia V100 GPU, it takes around 15 seconds per epoch. The whole Notebook requires around 75min for completion.
After completion of the training and evaluation process, we should see three different plots. In the top row on the left, we have training loss for our two experiments using random subsampling. On the right, we have the test accuracy for all three experiments (the two subsampling ones and one for the full dataset). On the bottom, we have a closer look at the accuracy results for the training epochs 50–100.
There are two jumps in accuracy and loss at epochs 60 and 80. Those two jumps come from the update of the learning rate. If you have a closer look at the accuracy you will notice that the accuracy of the experiment using the WhatToLabel subsampling (red) is very similar to the one using the full dataset (blue). The experiment using random subsampling (green)has lower accuracy. The results from the training process support those findings. The top accuracies for the three experiments are as follows:
At first, this might make no sense. But let’s go back to the beginning of the post where we were talking about redundant data. Let’s assume that we have such redundancies within the Fashion-MNIST dataset. If we remove them, the dataset is smaller but the information a model can obtain by being trained on it won’t decrease the same amount. The easiest example of such redundancy is a similar-looking image. But we’re looking for more than just similar images, we look for similar feature activations or semantic redundancies.
It’s important to conduct multiple experiments and report mean and standard deviation. Maybe our result from before was just an outlier? We at WhatToLabel have an internal benchmarking suite that evaluates our filtering software on various datasets, with multiple training set sizes, and with multiple seeds.
I hope you like this post. In our next posts, we will go more into detail about data redundancies.