Data You Don‘t Need: Removing Redundancy

This blog was published on https://towardsdatascience.com/the-data-you-don-t-need-removing-redundant-samples-6bfd07c1516c

In ML there is the saying garbage in, garbage out. But what does it really mean to have good or bad data? In this post, we will explore data redundancies in the training set of fashion-MNIST and how it affects test set accuracy.

What is Data Redundancy?

We leave the more detailed explanation for a next post but let’s give you an example of redundant data. Imagine you’re building a classifier, trying to distinguish between images of cats and dogs. You already have a dataset of 100 cats and are looking for a dataset of dog pictures. Two friends of yours, Robert and Tom, offer their dataset.

  • Robert took 100 pictures of his dog Bella he took last week.
  • Tom offers you pictures of 100 different dogs he collected over the last year.

Which dataset of dogs would you pick?

Of course, it depends on the exact goal of the classifier and the environment it will run in production. But I hope that in the majority of the cases you would agree, that the dataset from Tom with 100 different dogs makes more sense.

Why?

Let’s assume every image has a certain amount of information it can contribute to our dataset. Images with the same content (same dog) add less additional information than images with new content (different dogs). One could say that similar images have semantic redundancy.

There are papers such as The 10% You Don’t Need exploring this in more detail.

Remove Redundant Data

Papers such as The 10% You Don’t Need follow a two-step procedure to find and remove less informative samples. First, they train an embedding. Then they apply a clustering method to remove nearest neighbors using Agglomerative Clustering. For the clustering, they use the common cosine distance as a metric. You could also normalize the features to the unit norm and use L2 distance instead.

There are two problems we need to solve:

  • How do we get good embedding?
  • Agglomerative Clustering is slow O(n³) time and O(n²) space complexity

The first problem the authors solve by training the embedding using the provided labels. Think about training a classifier and then removing the last layer to get good features.

The second problem needs more creativity. Let’s assume we have very good embeddings separating the individual classes. We can now process the individual classes independently. For a dataset with 50k samples and 10 classes, we would run 10 times the clustering on 5k samples each. Since time and space complexities are O(n³) and O(n²) this is a significant speedup.

The Startup Approach of Lightly (formerly WhatToLabel)

At Lightly we want to make the use of machine learning more efficient by focusing on the most important data. We help ML engineers filter and analyze their training data.

We use the same two-step approach.


A high-level overview of our data selection algorithms

First, we want to get a good embedding. A lot of our effort is focusing on this part. We have pre-trained models we use as a base we can fine-tune on a specific dataset using self-supervision. This allows us to work with unlabeled data. We will explain the self-supervision part in another blog post. The pre-trained model has a ResNet50 like architecture. However, the output dimension of the embedding has only 64 dimensions. High-dimensions introduce various issues such as higher computation and storage time and less meaningful distances due to the curse of dimensionality.

The way we train our embedding we still get a high accuracy:


Comparison of different embedding on ImageNet 2012 val set

Second, we want to use a fast algorithm for data selection based on the embedding. Agglomerative Clustering is too slow. We explore the local neighborhoods by building a graph and then iteratively run algorithms on it. We use two types of algorithms. Destructive ones, where we start with the full dataset and then remove samples. Constructive algorithms on the other hand start building a new dataset from scratch by only adding relevant samples one by one.

When we combine the two steps we get a fast filtering solution that even works without labels.

We have developed a product to do exactly that which you can check out at Lightly.ai or you can build your own pipeline following this two-step Procedure for Filtering.

For the following experiments, we use the Lightly data filtering solution.

Filtering Fashion-MNIST to remove redundancy

The full code to reproduce the experiments is provided in this GitHub repository: https://github.com/lightly/examples

Now, we know what redundancies are and that they can negatively influence training data. Let’s have a look at a simple dataset such as fashion-MNIST.

Fashion-MNIST contains 50k/ 10k in the train/ test set. For the following experiment, we only filter the training data. Our goal is to remove 10%. Once with a random selection of the samples and once with a more sophisticated data filtering method. We then evaluate the various subsampling methods against the full dataset. For the evaluation, we simply train a classifier.

Experiment Setup

We use PyTorch as a framework. For the evaluation, we train a resnet34 with SGD using the following parameters:

  • Batch Size: 128
  • Epochs: 100

To ensure reproducibility we set the seed for random number generators:


Set random seed for reproducibility

We normalize the input data using the ImageNet statistics and for training data, we use a random horizontal flip as data augmentation. Additionally, we convert the B/W image to RGB.


Data transformation for train and test set

In order to conduct multiple experiments with different % of training data, we use a simple trick. Within the data loader, we use a random subset sampler which only samples from a list of indices. For the Lightly filtered dataset, we provide a list of indices within the repository. For random subsampling, we create our own list using NumPy.


Code for getting the indices for the experiments and creating the data loaders

Before training our classifier we want to rebalance the weight for the cross-entropy loss based on the number of samples for each class.


Python code for rebalancing the loss based on the number of samples per class

Now everything is in place and we can start training the model. On an Nvidia V100 GPU, it takes around 15 seconds per epoch. The whole Notebook requires around 75min for completion.

Results

After completion of the training and evaluation process, we should see three different plots. In the top row on the left, we have training loss for our two experiments using random subsampling. On the right, we have the test accuracy for all three experiments (the two subsampling ones and one for the full dataset). On the bottom, we have a closer look at the accuracy results for the training epochs 50–100.


Plots showing training loss and test accuracy for the three experiments

There are two jumps in accuracy and loss at epochs 60 and 80. Those two jumps come from the update of the learning rate. If you have a closer look at the accuracy you will notice that the accuracy of the experiment using the Lightly subsampling (red) is very similar to the one using the full dataset (blue). The experiment using random subsampling (green)has lower accuracy. The results from the training process support those findings. The top accuracies for the three experiments are as follows:

  • Best test accuracy using Lightly subsampling (90%): 92.93%
  • Best test accuracy using the full training dataset (100%): 92.79%
  • Best test accuracy using random subsampling (90%): 92.43%

Higher accuracy with less training data!?

At first, this might make no sense. But let’s go back to the beginning of the post where we were talking about redundant data. Let’s assume that we have such redundancies within the Fashion-MNIST dataset. If we remove them, the dataset is smaller but the information a model can obtain by being trained on it won’t decrease the same amount. The easiest example of such redundancy is a similar-looking image. But we’re looking for more than just similar images, we look for similar feature activations or semantic redundancies.

Repeating the experiment with different seeds?

It’s important to conduct multiple experiments and report mean and standard deviation. Maybe our result from before was just an outlier? We at Lightly have an internal benchmarking suite that evaluates our filtering software on various datasets, with multiple training set sizes, and with multiple seeds.


Fashion-MNIST test accuracy of Lightly (WhatToLabel) vs random subsampling

I hope you like this post. In our next posts, we will go more into detail about data redundancies.

Igor, co-founder
Lightly.ai

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us