Mean Average Precision (mAP) is a key evaluation metric in object detection and information retrieval. It averages precision-recall performance across classes, offering a single score that balances accuracy and completeness of predictions.
Quick answers to common questions about mean Average Precision (mAP):
Mean Average Precision (mAP) is an evaluation metric that summarizes a model’s ability to make precise and complete predictions. It is widely used in object detection and information retrieval to measure overall performance.
In essence, mAP is the mean of Average Precision (AP) scores over multiple classes or queries, giving a single number that reflects the model’s precision-recall tradeoff across all targets.
To calculate mAP, first compute the Average Precision (AP) for each class (in object detection) or each query (in information retrieval). AP is the area under the precision-recall curve for that class/query. Once you have AP for each, mAP is simply the mean of those AP values.
This involves evaluating model predictions against ground truth, determining True Positives (TP), False Positives (FP), and False Negatives (FN) at various confidence thresholds, plotting precision vs. recall, and finding the average precision for each target.
Average Precision (AP) refers to the precision-recall area for a single class or query. It summarizes how well the model achieves high precision across all levels of recall for that one case. Mean Average Precision (mAP) is the mean of AP scores over all classes or queries. In object detection, you calculate AP for each object class and then average them to get mAP. (Note: Some sources use “AP” and “mAP” interchangeably when a single summary across classes is implied.)
Mean average precision tells you how good your model is at finding all the relevant results (objects or documents) while keeping predictions accurate. It’s a single score that integrates precision and recall for comprehensive evaluation.
We use various metrics to measure the performance of a model in computer vision tasks involving object detection or re-identification (ReID). The most commonly used metric is Mean Average Precision (mAP). It gives a single overall score of a model's prediction accuracy across all object categories.
The mAP is used in many benchmark challenges, such as the PASCAL Visual Object Classes (VOC) challenge, the COCO challenge, and others.
In this guide, we will cover:
If you want to build your computer vision model with a high mAP score, then you need quality data for training. At Lightly, we help you optimize both the data and the model.
You can try both for free to see how intelligent curation cuts labeling costs and to build image classifiers, object detectors, and semantic segmentation models with higher mAP.
We first need to know about precision and recall metrics to understand mAP.
These two come from the outcomes of a model's predictions that are usually organized into a confusion matrix.
A confusion matrix is a table that shows how accurately a machine learning model’s predictions match the ground truth.
For a given class, there are four possible outcomes:
In object detection, things are much more complex than a simple right or wrong. The detection model predicts the object in the image and also tells its position with a bounding box.
We compare how much the predicted box overlaps with the ground truth box to decide if the model's prediction is correct using the Intersection over Union (IoU) metric.
Intersection over Union (also called the Jaccard Index) measures the overlap between the predicted bounding boxes and the ground truth bounding boxes.
It is calculated as a ratio of the overlap area (intersection) to the combined area covered by both boxes (union).
A higher IoU (closer to 1) indicates the predicted bounding box coordinates are more closely matched with the ground truth box coordinates, while an IoU near 0 means minimal overlap.
We calculate the IoU score for each detection and set a threshold to classify it. If a detection's IoU score is above the threshold, we classify it as a positive prediction. If it falls below, it is classified as a false prediction.
Using the scores, we categorize the predictions as true positives, false negatives, and false positives.
Precision measures how accurate a model's positive predictions are. It is the ratio of correct positive predictions (TP) to the total number of positive predictions made (TP + FP). The precision formula is:
Recall measures how well the model finds all the actual positives. It is the ratio of correct positive predictions (TP) to the total number of actual positives (TP + FN). The recall formula is:
Let’s take an example to illustrate how recall and precision are calculated. Below are images showing objects with ground truth boxes on the left and predictions on the right, and the IoU threshold set to 0.5.
Recall and precision are then computed for each class using the formulas above, by accumulating the counts of TP, FP, and FN.
Let’s calculate recall and precision for the ‘Person’ category:
Now, if you’re not satisfied with those results, what can you do to improve the performance? You can often adjust a model's confidence threshold.
But if you set a high confidence threshold, the model will only make predictions it is highly confident in. This usually leads to high precision but lower recall (it might miss some less obvious objects).
If the confidence threshold is set low, the model will make more predictions, even if it's not very sure. It can cause a high recall but lower precision. Because of this tradeoff, we use a precision-recall curve (PR curve) for a more complete view.
A precision recall curve is a graph that plots precision values (y-axis) against recall values (x-axis) at different confidence thresholds.
You rank all your model’s positive predictions by their confidence scores to create the curve, then calculate precision and recall at each point.
A good model maintains high precision even as recall increases, so its PR curve stays near the top-right corner of the graph. A poor model's precision drops quickly as it tries to achieve higher recall.
The PASCAL Visual Object Classes 2012 (VOC2012) challenge uses the PR curve as a metric alongside average precision. It is a supervised learning challenge with labeled ground-truth images. The dataset includes 20 object classes such as person, bird, cat, dog, bicycle, car, chair, sofa, TV, bottle, etc.
The PR curve provides a clear visual summary, but the curve can be noisy, and its saw-tooth shape makes it difficult to estimate the performance of the model.
Similarly, it can be hard to compare different models when their PR curves cross each other. Therefore, we calculate a single number called the Average Precision (AP).
Average Precision (AP) is the area under the precision-recall curve. It shows the average of all the precision scores across different recall levels. A higher AP means the model gives better performance across all confidence thresholds.
A common technique for calculating AP is 11-point interpolation (used in Pascal VOC competition). It involves averaging the maximum precision value for 11 spaced recall evaluation points (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0).
For each recall point r, find the highest precision p from all recalls r˜ ≥ r. Mathematically:
Where:
Where p(r˜) is the measured precision at recall r˜.
Graphically, this is simply a way to smooth out the spikes from the graph and make it look like:
Let’s calculate the interpolated AP for all recall values using the p_interp formula, and though our recall levels start at 0.2, the strategy remains the same.
Now, calculate the average of all of the interpolated precision points.
The modern benchmarks use the 11-point interpolation method to calculate the exact AUC by considering all unique recall points, which provides a more precise AP value. This single AP score gives a strong way to evaluate a model's performance on a single class.
In addition to the average precision, it's helpful to compare it with other evaluation metrics to better understand its benefits.
Here is a summary of some of the key metrics above and how they relate to each other.
Once we grasp average precision, the concept of mean average precision (mAP) is quite simple. The mAP is calculated by taking the mean of the AP scores over all classes. If a dataset includes N different classes, you compute the AP for each class and then average them.
For instance, if you have a model for three classes (cat, dog, bird) and calculate their average precision to be:
The mAP calculation would be:
Similarly, in the PASCAL VOC dataset, AP is computed for each of the 20 categories and then averages these to get the mean average precision.
The mAP is a standard metric for evaluating object detection models. The object detection algorithms like YOLO, SSD, and Faster R-CNN are all evaluated based on their mAP scores.
Here is a summary of the steps to calculate mAP in object detection:
A higher mAP score indicates that the model is both accurate in its classifications and good at localizing objects.
Different object detection benchmarks use slightly different ways to calculate mAP through varying the IoU threshold.
The PASCAL VOC challenge calculates mAP using a single IoU threshold of 0.5 (mAP@0.5). It uses an 11-point interpolated precision method for the AP calculation.
MS COCO challenge calculated mean average AP over 10 different IoU thresholds, from 0.5 to 0.95 in steps of 0.05 (mAP@[.5:.95]). This method rewards models that achieve high precision in localization.
Similarly, the Google Open Images dataset, used in the Google Open Image Challenge, also uses mAP@0.5 for its detection track, but over a much larger set of 500 classes.
So far, we are discussing the mAP in terms of object detection, but it is also a key metric in information retrieval. It assesses the effectiveness of search algorithms and recommendation systems.
An information retrieval task involves a search system, like a search engine, where someone types in a query and it returns a list of documents ranked by relevance.
Similar to object detection, mAP is calculated by averaging the average precision (AP) scores across multiple queries or search terms.
AP in the information retrieval context is the average precision at different recall levels for a single query, considering only the relevant documents. A higher mAP means the system is better at retrieving relevant information at the top of the search results.
Mean average precision is a go-to metric due to its numerous strengths for evaluating models in information retrieval and computer vision applications. However, it's also good to know its limitations.
Before diving into the drawbacks, first consider why mAP is so widely used.
Keep the following limitations in mind to get a balanced understanding of the mAP role.
Improving the model's mAP score is usually the primary goal in training. Here are some practical ways to do it:
Improving the model mAP score requires starting with better training data, not just in quantity, but in relevance. Raw data can be huge and contain a lot of repetitive or easy examples that don't help the model learn and add extra labeling costs, too.
We can use the LightlyOne platform to pick the most informative and varied samples that help raise mAP. It cuts out redundancy and focuses on underrepresented examples by using visual embeddings and self-supervised learning.
Using only quality data for training, LightlyOne helps reduce false positives and false negatives, which improves average precision for each class.
Instead of starting from scratch, we can use LightlyTrain to pretrain our own vision models on the unlabeled data selected from LightlyOne. It performs better than models that start from pre-trained ImageNet weights or are trained from scratch.
LightlyTrain achieves up to 36% higher mean average precision (mAP) in spotting objects compared to traditional training methods.
In the object detection use case, Lightly AI helped AI Retailer Systems reach 90% of their maximum mAP using only 20% of the training data.
Mean average precision (mAP) is the current standard performance metric and brings clarity to evaluating computer vision models. It provides a single score that combines the accuracy of predictions (precision) with the ability to detect all relevant items (recall).
You can better assess your models and make smarter training choices by understanding how mAP works. Although it has its complexities, mAP is a valuable tool for developing more accurate and robust computer vision systems.
Get exclusive insights, tips, and updates from the Lightly.ai team.
See benchmarks comparing real-world pretraining strategies inside. No fluff.