Data Labeling: AI’s Human Bottleneck

This post was published on: https://medium.com/whattolabel/data-labeling-ais-human-bottleneck-24bd10136e52

AI applications form a $100 billion market in 2025

Customers are increasingly demanding smart products such as autonomous cars or home assistants. This leads to the expected growth of the AI market to over $100Bn by 2025 (image below). But what does it take to make a product smart?


AI Market Growth (Tractica (2019); WhatToLabel Research)

To build those smart products, engineers need to develop machine and deep learning algorithms. However, those algorithms are not smart by default. Like humans, they need to learn. In technical terms: They need to be trained with labeled data.

Data labeling limits AI development

Labeled data is data (e.g., images) to which a human worker has added labels for certain objects telling the machine: “That’s a car” or “that’s a cat.” Through those labels, the machine learns until it can identify the objects itself and react accordingly. This topic of data labeling has developed a huge economy in just 6 years. Players such as Appen, Scale.ai, Playment, Understand.ai, and iMerit are managing millions of contract workers in low-cost countries to label data. Despite outsourcing, having the human in the loop is still very costly.


Source: Scale.ai/pricing

Labeling an image for semantic segmentation costs up to $6.40 per image at the Silicon Valley startup Scale.ai (image above). Dataset sizes for deep learning are usually around 20k–50k leaving you with costs of up to $320'000 just for one dataset. In autonomous driving, dataset sizes can even go up to millions. Additionally, the overall trend is that data is expected to grow — on average, 30% for the next 7 years. This means that companies are gathering more and more data. Thus, dataset sizes increase as well. Under those circumstances, using humans for data annotation will severely limit AI’s development in the coming years. Today, even companies with large development budgets are asking themselves how to limit their growing data labeling expenditures.

Most time is spent on data-related tasks

Source: Cognylitica; Factordaily

Today, up to 80% of the time spent on machine learning is allocated to data-related tasks. A fourth of all time just for data labeling. That’s huge, and it shows the importance of data preparation for deep learning. Remember, time equals money. So why waste valuable resources on data which you don’t need?

Data Filtering: Updating the deep learning pipeline

Source: Timon Ruban, Luminovo (2018)

In the above image from Timon Ruban, we see that after the first step Data Sourcing (also called Data Collection) Data Labeling immediately follows. Increasingly this is not a viable option anymore. We see that companies are beginning to use Data Filtering methods to select the most relevant samples before moving on to data labeling. This has two advantages:

  1. They can reduce their data-related costs (e.g., labeling, storage, and transfer) heavily by up to 50% — some even reduce to 1%
  2. Through filtering their raw training data, they reduce overfitting and achieve better results for their deep learning-based products

There are several ways to deploy data filtering. They can be based on statistical methods, active learning, or self-supervised learning (e.g., whattolabel.com). One can develop it themselves or use software solutions out in the market. Depending on what you are looking for and your resources (e.g., time), the one or the other makes more sense.

In a nutshell, customers demand smart products, and therefore it is crucial to invest in AI development. However, we should not only build smart products but also spend the money smart. Consequently, it is crucial for well-managed organizations to label, store, and transfer data, which is relevant to save resources for the most important task.

Matthias Heller,
Co-Founder Lightly.ai

Thanks to Mara Kaufmann (hide) and Igor Susmelj (hide).