Active Learning Strategies Compared for YOLOv8 on Lincolnbeet

Learn how different data selection strategies impact model accuracy. We use the lincolnbeet dataset and YOLOv8 model for our experiments.

Agriculture is one of the domains that could benefit a lot from recent breakthroughs in computer vision. Having machines that can analyze millions of crops throughout the year to optimise yield and minimise the required amount of pesticides that are required has a big impact!

We show how ML teams can save up to 77% of labeling costs or improve the model by up to 14.6x per additional labeled batch when using active learning compared to random selection!

We take a closer look at one application of computer vision in agriculture: Using robots equipped with cameras to optimise precision spraying of weeds on large fields of crops. In this example we use the lincolnbeet dataset and set out with the goal of building a reliable computer vision system.

This showcase aims to illustrate how using a smart data selection strategy like active learning yields significant benefits compared to random selection. We show how ML teams can save up to 77% of labeling costs or improve the model by up to 14.6x per additional labeled batch when using Active Learning compared to random selection!

For benchmarking different data selection strategies, we will use Lightly. Lightly has built a scalable active learning solution that can be easily plugged into any existing computer vision pipeline. We showcase different built-in strategies to select data for the object detection task of the lincolnbeet dataset evaluated using the YOLOv8 model. You can get started using Lightly for free.


The lincolnbeet dataset consists of 4 402 full-hd images with a total of
39 246 objects in them. The dataset consists of two classes sugar beet and weed plantswith bounding boxes. The two classes are almost equally represented with 16 399 (42%) sugar beets and 22 847 (58%) weed plants. Since we have on average almost 10 objects per image with a rather high image resolution the cost of annotating this data can be very high.

Example image showing predictions of a YOLOv8 model on lincolnbeet dataset

Our goal is to use an active learning feedback loop where we iteratively label a bit of data, train a model and then pick the next batch for labeling based on the model output. Our goal is to get to a high accuracy with less than 400 annotated images.

  • training set size: 3 089 images
  • validation set size: 441
  • test set size: 883

We analyze in more detail how different selection strategies can impact the selected data.


For our baseline model, we pick 200 images randomly from the training set. It’s also possible to pick the initial 200 images already using Lightly. But since our focus is to show how the various selection strategies can improve an existing dataset, we fix the initial dataset to be a random subset of 200 images.

We use two different seeds of the YOLOv8 model for all further experiments. The plots, therefore, show the standard deviation additionally.

The exact code we use to train all of the YOLOv8 models can be found below. We train for 50 epochs with a batch size of 8. We additionally use random vertical flip (flipud) augmentation and increase the input image size to 960 pixels to work better on small objects. We keep these parameters the same and only vary the data (which training set we use) and seed parameter during our experiments.

To evaluate the model based on a checkpoint, we can use the following CLI command:

Let’s see how the baseline model performs. We show the initial performance of our baseline YOLOv8 model trained on 200 random images.

F1 and PR curve reported by the buil-in training pipeline from YOLOv8. As you notice, the performance for the weed class is slightly worse than for sugar beet. We therefore also oversample the weed class in some of our experiments with Lightly.

We also show some example images of what the model sees during training and the validation step. As you can see, the training images are heavily augmented. The default augmentations range from random color change, and resize to mosaic augmentations and more.

Examples of a random training (left) and validation (right) batch fed to the YOLOv8 model.

For each of the experiments we perform the following steps:

  1. Run Lightly to select a subset of the data based on the selection criteria
  2. Update the YOLOv8 training set to use the newly selected data
  3. Train the YOLOv8 model
  4. Evaluate the YOLOv8 model
  5. Repeat steps 3 and 4 using another seed

For step 2, we need a way to sync the selected data from Lightly with the dataset in the YOLO format. We can do this using the following code:

Note that we only update the training set .txt file. YOLOv8 uses a .yaml file that tracks which set is used for training and validation. We can now just copy this yaml file and only change the training set.


As the name suggests, we simply randomly select 400 images in this experiment. You find the Lightly Selection config to pick 400 images randomly here:


ALL refers to the selection strategy used in the Lightly YOLOv7 tutorial. Note that all the other strategies we compare against are ablations of the ALL strategy. This strategy contains the following elements:

  • We use embeddings to find diverse images. The embeddings are computed based on the cropped images using the bounding boxes of our YOLOv8 model.
  • We train our embedding model directly on the cropped images and not on the full frames.
  • We use balancing to get the target ratio of 30% sugar beet and 70% weed. We pick a 30/70 ratio because the initial model seems to struggle more on the weed objects and we want to oversample these cases.
  • We use the predictions to prioritize images with many objects — crowded scenes using the frequency scorer built-in to Lightly.
  • We use the prediction probability (objectness least confidence) to get images that are more likely around the decision boundary

We present the detailed selection config we used for selecting 400 images. You will often find code using “yolov8-random-200-detection”. This is the task containing the predictions of our YOLOv8 model trained on the initial 200 images.


Using the frequency scorer is a simple way to oversample crowded scenes. This also brings its drawbacks. If we pay per single annotation, crowded scenes will be more expensive to annotate. We, therefore also use the ALL strategy without the frequency scorer.


Using object predictions and their predicted classes, we can estimate the ratio between the classes in the not-yet-annotated dataset. We can use that information directly in Lightly to set target ratios. This can be very helpful if we care more about a less present class than a frequent one. For this dataset, we wanted to oversample weed using a 70 / 30 ratio. But this constraint also impacts the overall selected dataset. This experiment removes the balancing goal.


Our base strategy, ALL, trains the embedding model directly on the object crops. What if we train on the full frames instead? Because we have roughly 10 objects per frame we train for 10 epochs instead of one to make sure the model has seen more or less the same number of samples during training.


Same as before, we train on images instead of crops. But this time we also make sure we remove the frequency scorer.


Remember that we can change the way we train and embed images. We can train on images or crops, and then we can embed them on images or crops. In total, we have four options. As we don’t expect to gain much from training on objects but embedding frames (there would be lots of context missing), we cover the case of training on frames and embedding frames for the diversity criteria.

We can use embeddings for the full frame (left) or for the individual objects (right). Depending on the use case, we might prefer one over the other. For example, to find diverse or similar scenes, we want to use frame-level embeddings. To find diverse or similar objects, we want to work with object embeddings. (Image is own composition of screenshots from the Lightly Platforms embedding plot)

The following code implements training and embedding on images (and not crops):

Metrics for Evaluation

For all experiments, we compute the mAP50 and mAP50..95 using the built-in metrics from YOLOv8. The metrics are computed similarly to the COCO benchmark.

We use the Lightly API Client to fetch the subset of filenames selected with the corresponding selection strategy. We can do this by comparing the filenames of the selected 400 samples and only keeping them in the YOLOv8 training list.

Furthermore, we also report metrics such as the number of objects annotated in the selected training data. This can be useful for us to compute the cost of the selected data and put it in perspective to the gain in mAP.

We show the results for mAP50.

These plots show the mAP50 scores for the different experiments. We split the results into two plots to make the plot more readable. We show the results for the random baseline in both plots. As you can see all methods using the various data selection strategies we discussed outperform the random baseline.

We also show results for mAP50..95:

Number of objects in the newly labeled set

We also analyze how many objects have been labeled based on the selection with the various methods. This can give us insights into the costs that arise with the gain in accuracy.

The following table shows the number of objects of the two classes.

Table showing how many objects are in the datasets that the different selection strategies pick. We count here the number of objects based on the labels of the dataset.

We can also compute the gain in mAP of the different methods compared to our baseline experiment (randomly selecting 200 images). For example, randomly selecting 400 instead of 200 images yields an average gain of 0.25% in mAP50 and 0.85% in mAP50..95.

Comparison of the different methods compared to the baseline (randomly selecting 200 images). We can see that the “ALL-train-on-images” approach yielded 14.6x more mAP50 and 3.9x more mAP50..95 compared to randomly selecting data.

Finally, we can also compute the number of additionally annotated objects. For example, the ALL method resulted in 400 images and 12 059 objects being selected vs. 400 images and 3 472 that were selected with the random method. The gain in accuracy comes with a high price; let’s quantify it.

In the table below, you find the number of newly annotated objects that result in a gain of 1% in mAP50 or mAP50..95. Interestingly, most methods outperform random selection meaning that we can save $$$ by using any of them. Another interesting insight is that the methods, including the frequency scorer, are the most expensive ones. We get a great boost in mAP but also many new objects to label.

In this table, we list the number of additional annotated objects required to get a 1% higher mAP. We use the metrics from the previous tables and experiments for this computation. Note that we compare all methods against the already randomly selected 200 images, which contain 1981 objects. (e.g., for random mAP50, we compute (3472–1981)/0.25=5'964).

Assuming costs of $0.05 per bounding box, we end up with different costs to increase our model performance by 1 mAP. For example, the strategies without the frequency scorer are more cost-efficient as fewer objects are annotated and can save up to 77% in mAP50 and 35% in mAP50..95 in label costs.

Based on the number of bounding box objects and the cost per bounding box, we can calculate the cost per mAP increase. We assume a price of $0.05 per bounding box for our calculations. We also show the savings in % color with a colored background.


In this post, we looked at different strategies supported by Lightly to select training data and their impact on model accuracy. Furthermore, we evaluate the methods in terms of accuracy (mAP) and cost efficiency when it comes to annotation costs.

Igor Susmelj,
Co-Founder Lightly

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us