Label-efficient Content Moderation with Self-supervised Learning


The greatest social media platforms provide a place for to share, discuss, and admire content created by users. Yet it is exactly this user-created content that poses one of the biggest challenges to today’s social media platforms: Content moderation. While the bulk of the users adhere to terms of usage, there will always be exceptions: Users who share explicit or offensive content. For example, women on dating platforms are often targeted with abusive messages or unwanted photography. For social media platforms it is of utmost importance to detect and remove such content immediately or they risk damaging their reputation and loosing valuable revenue from ads. Detecting and classifying offensive content is a difficult problem that requires human annotations which is why companies, such as Facebook, use an error-prone combination of AI and outsourced jobs.

Fortunately, the rise of self-supervised learning has provided us with the perfect tool to tackle this problem: Deep neural networks can learn representations of uploaded contents without any human having to inspect them beforehand. Afterwards, a simple linear classifier is trained atop of a small, constituting subset of representations.

In this blog post we will show how to do label-efficient content moderation for images using self-supervised learning and detecting images that are considered not-safe-for-work (nsfw).

Self-supervised Learning

When it comes to choosing which method to use for self-supervised learning one can quickly get lost in the many options available. Dedicated frameworks such as open-source lightly implement many different methods and all of them have unique benefits. Luckily, for this blog post, there’s one quality we’re seeking out: The pre-trained machine learning model needs to be label-efficient. This means that it requires only few annotated images to generalize well on the final content moderation task. Masked Siamese Networks (MSN) excel at this as can be seen in the table from the paper:

Table 1 from Masked Siamese Networks
for Label-Efficient Learning
: MSN excels at extreme low-shot learning.

Using the examples from lightly we can set up self-supervised training using Masked Siamese Networks in just a few lines of code. Let’s start by importing the required modules and setting up a logger:

Then, we add a helper class to implement Masked Siamese Networks and initialize the vision transformer.

Next, we set up the dataset, augmentations, optimizer, and learning rate scheduler.

Finally, we implement the MSN  training loop and save a checkpoint every 100 epochs. With our setup (NVIDIA RTX A6000) training for 800 epochs takes approximately six days to finish. We chose 800 epochs because then training finishes in a reasonable time but it would have been possible to train for longer.

That’s it! After 800 epochs the loss slowly starts to converge as can be seen in the plot below. Note that self-supervised learning typically profits from longer training so it would even have been possible to train for even longer.

MSN loss for 800 epochs of training. In general, self-supervised learning profits from longer training.

The pretrained model can be used to generate image embeddings (fingerprints) or for finetuning. In this blog post, we’ll explore how to finetune the vision transformer efficiently by training a linear classifier on top of frozen image embeddings.

Visualization of the embeddings generated by the pretrained MSN transformer using Lightly. Colored samples are part of the category “sexy”. The cluster on the left is are copy-right images (text on dark background).

Label-efficient Content Moderation

Now that we have a pretrained model we want to annotate some of the images and finetune the transformer on this labeled subset. Since the images can contain offensive material, we want to annotate as few images as possible in order to reduce the burden on the labelers who have to work through the material. For best performance, the labeled subset needs to be representative of the whole dataset. We use the diversity based selection algorithms by Lightly to get an optimal subset of 10 images  to be labeled. Below you can see the config we used to get a diverse subset of images from Lightly. Note that we used the embeddings generated by our pretrained vision transformer as input (that’s why we set “num_ftrs” to 384).

Next, we needed to label the images. For the purposes of this blog post we used a dataset with labels to save ourselves the work.

The considered categories are:

  • neutral
  • explicit

Then, we use logistic regression from scikit-learn to train a classifier on top of the representations learned by the Masked Siamese Networks.

The table below shows the accuracy of the trained linear classifier. Since the test dataset is split 40/60 in neutral and explicit images, an accuracy of 50% is better than assigning a label at random. The classifier trained on images selected by Lightly performs much better than the one trained on randomly selected images. The latter shows little gains over assigning the same label to all classes.

Accuracy of our linear classifier trained on 10 images. The test dataset is not balanced so 50% accuracy is better than a random classifier. When trained on randomly selected images, predictions are almost random.

Active Learning

Of course, the prediction model we trained in the last section is not perfect. However, considering it was trained on only 10 labeled images, it’s performing pretty good already. How can we go about improving our model as efficiently as possible? Since we’re using logistic regression, we can make use of the uncertainty of the prediction model to refine the decision boundary using active learning. Fortunately, Lightly supports this out-of-the-box. We can select another 90 images based on uncertainty, label them, and finetune the classifier on the combined images. See below the configurations we used with Lightly.

As expected, the accuracy improves drastically:

Accuracy of our linear classifier trained on 100 images selected with active learning.

Since recall is an important metric for content moderation we also show the precision-recall curve below. The plot shows that we can get 90% recall at 80% precision with only 100 labeled images.

Precision-recall curve of our linear classifier trained on 100 images selected with active learning (orange) and selected uniformly at random (blue).


We have shown how to use self-supervised learning together with active learning to train a content moderation model with a minimal number of labeled images. For information about the dataset or the specific training set up please reach out to for a white paper. If you are interested in trying out Lightly, sign-up for free and follow the Getting Started section in our docs.

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us