We live in a technologically advanced world. Technology's aim is to make both our lives and work easier by introducing a helping hand for various tasks. Thanks to machine learning, we don't necessarily even have to explicitly program an algorithm to finish a task. Machine learning models are becoming smarter on their own with each passing day.
We use loads of data to train models . Some require the classification of datasets with certain tags linked to them. Is it even possible to manually tag this huge pile of data? If not, is there an effective way to handle it? What if a model has a say in choosing its own data?
This is where active learning comes into play. In this post, I'll walk through the basic understanding of active learning in machine learning, its components, the different strategies used, and a generic use case. Let's begin with the definition.
What Is Active Learning: A Quick Definition
Active learning is a part of machine learning. It's also known as query learning. By query, we mean asking certain questions with the goal of extracting specific information. The learning focuses on a region of interest (ROI). Say we have a large dataset of unlabeled data, but we might be interested in only a few of the data points. The goal of active learning is that it can choose the data from which it wants to learn.
For instance, let's say we have a labeled dataset of certain diseases with their symptoms. When any new patient comes in and tells a doctor their symptoms, the disease can be identified. But what if the symptoms are new and there's no relevant information to identify the disease? This is where active learning comes into the picture. It gives leverage over the existing dataset.
Now that we've mentioned labeled and unlabeled datasets, we'll go over from where these terms arise and why they're important in a moment. But first, let's quickly look at supervised learning.
Supervised learning is a type of learning that requires labeled training data. Labelled data simply means that the input data has a tag associated with it. It's tedious to slog through each data point and manually label them.
Let's look at another example. Say we're developing a speech recognition model. You need accurate audio so that each word is recognizable. Correctly labeling each word can take a long time, approximately ten minutes for each minute of speech. It becomes a bigger problem if the dialect or language is not widely known.
Active learning is the solution to this problem. Let's go a bit deeper into active learning and look into the working of active learning.
Components of Active Learning in Machine Learning
There are basically four components of active learning:
- Unlabeled dataset: This is the group of data that is still to be identified or classified based on the application. This data can be selected from various resources.
- Oracle: This is a human annotator. The oracle asks certain queries according to the application and then marks the labeled data points in the unlabeled set.
- Labeled dataset: The oracle then selects the data points from the region of interest, and conversion occurs from unlabeled to labeled data points. Thus, this set becomes a training dataset.
- Machine learning algorithm: Now the model is developed based on the type of machine learning algorithm. The algorithm is decided based on the type of dataset and application.
We can perform active learning in three ways or strategies. Let's check them out one by one.
Membership Query Synthesis
As we've discussed, we select the data points from the unlabeled dataset based on some queries. The oracle performs this step. The acquisition of unlabeled data points (which are huge in number) is easy, but extracting relevant data points for labeling is challenging. Now let's see how we can find the labeled data from this strategy.
The strategy follows by picking up some data points from the unlabeled dataset. These instances are chosen by the oracle and sent to the learner. A learner is the one who generates a machine learning model. The learner queries for each instance from the oracle. After generating queries, it asks for the labels from the oracle.
But if we use this technique to classify certain images, for example, and we come across an image that consists only of noise, the image won't be labeled. This scenario, likewise, isn't well suited for speech recognition as the machine may encounter gibberish.
Stream-Based Selective Sampling
This strategy is helpful in acquiring an unlabeled pool of data. As the name suggests, the data points are collected from online streaming, then stored in some database. Then queries are performed on it in a selective manner. The strategy is quite useful when the input dataset has a non-uniform distribution.
The strategy follows certain steps. The first step is sampling. This is the most important step of the stream-based selective approach. There are three major benefits of sampling:
- Reduces the population size considered under study
- Has the ability to reduce errors or noises in images
- Reduces the cost of production, labeling the instances, etc.
The second step is to make a decision on the sampled data. This follows a probability distribution. As a result, each sample is drawn one at a time. Then a decision is made on whether the selected data point should be labeled or rejected.
Pool-Based Active Learning in Machine Learning
The basic idea of this strategy is to rank the samples in some manner. The pool of unlabeled data is collected all at once. This method is an offline approach. Hence, the collected data is not extracted from means of online streaming, as with stream-based selective sampling.
The samples collected from this approach are done in a static way. Hence, it takes one sample at a time for labeling.
Firstly, we divide the data into two groups. One is the pool data, and the other is the test data. The pool has the data points that have the capability to produce a good amount of information.
Secondly, we divide the pool data into k samples. The k depends upon the learner. These k samples act as training data. The rest of the data becomes the validation dataset.
The third step is to normalize all the groups of data. This step is useful for identifying whether the data is uniform or non-uniform.
Lastly, we train the model using the training data. As mentioned, the k sample is the training data. We give ranks or some kind of weights to these samples.
Now let's put all these things together and look at active learning through an example.
Let's start with a generic example that will help us understand supervised learning and then active learning implementation.
Suppose there are five data points (P1, P2, P3, P4, and P5) with two features associated with each data point (e.g., A and B). Since this is a general example, A and B are simply references to the features.
As we're dealing with supervised learning, the prediction of this learning is to assign labels to the unlabeled data. Labels are tags given to the data points. These tags are dependent on the type of application. For example, an algorithm identifies whether a data point is a cat or dog. Therefore, the label can be either a cat or a dog.
The example considered here consists of unlabeled data points that have no label, and the motive is to predict the labels for each of them. The decision of what kind of labels are chosen depends on the labeled dataset. Let's start by working step by step, and then later, we'll see how labels are assigned.
As we have a dataset comprised of five data points, we need to divide this set into two groups, labeled and unlabeled. The labeled set is a seed. A seed is the dataset where data points will have the predicted labels already given to them. As the seed is used for the learning process, the percentage of the seed is small compared to the unlabeled data group.
Since our task is to assign labels to each of them, let's consider labels to be either Y or N. Start the process of applying a label to the seed. Out of five data points, we're choosing two data points (P1 and P3).
Now we'll train the model. For this purpose, we use the above-declared seed. The labels defined above will be the basis of the training.
It's time to apply the strategies discussed above. From the unlabeled datasets, we now have to choose appropriate datasets using the strategies. Let's use a pool-based approach here. At every iteration, we'll choose two data points from the unlabeled lot. Thus, these two data points will come under a labeled dataset. Now the labeled dataset will look like this:
Now we repeat steps two and three until we reach some benchmark.
There is a need to define some benchmarks. For example, we can define a limit on how many unlabeled datasets will be covered as one of the benchmarks.
The above use case is just an example of implementing active learning. With the use of the different strategies discussed above, we can apply this concept with more efficiency and with a lower response time.
To sum up, active learning is a smarter way to train a model . It helps reduce human efforts, time, and cost. Thus, the predicted output has the highest accuracy. With the understanding of its basic components and strategies, it’s easy to implement and is not hard to learn. This technique is still in the process of being researched in various other artificial intelligence verticals as well.
Also, check out Lightly, a data curation solution that's useful for helping companies to identify what dataset is best suited for their application for further labeling through active learning.