Intersection over Union (IoU) is a key metric in object detection and segmentation that measures the overlap between predicted and ground truth boxes. It quantifies localization accuracy, impacts AP/mAP, and is equivalent to the Jaccard Index.
The answer to some common questions about IoU:
 IoU measures the overlap between a predicted bounding box and the ground truth bounding box of an object. It’s calculated as the area of their intersection divided by the area of their union, yielding a value between 0 and 1 (where 1 means perfect overlap). IoU is a core concept in object detection for evaluating how well the model localizes an object.
 To calculate IoU, find the overlap area of the predicted and ground truth boxes and divide it by the total area covered by both boxes (union). In practice, you determine the coordinates of the intersection rectangle (using the max of the left/top edges and min of the right/bottom edges of the two boxes), compute that intersection area, then sum both boxes’ areas and subtract the intersection to get the union. The IoU formula is: IoU = (Intersection Area) / (Union Area).
IoU is a crucial evaluation metric in computer vision, used to quantify localization accuracy and compare model performance. In object detection tasks, a higher IoU means the predicted box aligns better with the object’s true location, indicating higher localization precision. During evaluation, an IoU threshold (e.g., 0.5) is often applied to decide if a predicted box is a true positive detection. IoU thus directly affects metrics like Average Precision (AP) and mean Average Precision (mAP), which gauge overall model accuracy.
There’s no single “good” IoU for all cases – it depends on the task. A common choice is 0.5 (meaning the predicted box must overlap at least 50% with the ground truth to count as correct). An IoU of 1.0 indicates a perfect overlap, while 0 means no overlap at all. Higher IoU thresholds (like 0.75) demand more precise localization and result in higher precision but lower recall (fewer detections meet the criterion). In practice, 0.5 is used for a balanced evaluation, but some benchmarks report stricter metrics (e.g., AP@0.75). For segmentation tasks, mean IoU scores closer to 1.0 indicate more accurate pixel-wise overlaps between prediction and ground truth.
Yes – IoU is essentially the Jaccard Index (Jaccard similarity coefficient) applied to shapes in an image. Both terms refer to the ratio of overlap area to union area. In image segmentation, the IoU is often called the Jaccard index and is used to evaluate how well the predicted mask matches the ground truth mask. Mean IoU (mIoU) refers to the average IoU across all classes or instances, and it’s a standard metric for segmentation performance.
‍
If you’ve ever worked on computer vision tasks like object detection or image segmentation, you have likely run into Intersection over Union (IoU), commonly called Jaccard's Index. It gives a single overall score of a model's localization accuracy.
IoU plays a crucial role in many computer vision applications, including autonomous vehicles, medical imaging, and security systems.
In this guide, we will cover:
Getting high IoU scores depends on the quality of the data you put into the training and the model's tuning. At Lightly, we help you improve both the data and the model.
Whether you're building object detectors or segmentation models, Lightly makes the process faster, smarter, and easier.
Intersection over union is an evaluation metric that quantifies the overlap between two regions. In the case of object detection and segmentation, these regions are the ground truth (the correct, hand-labeled area) and the predicted region from the model.
Simply put, IoU measures how well the prediction and ground truth agree on the area of the object.
‍
We calculate the IoU by putting the overlap area (intersection) in the numerator and the combined area covered by both boxes or masks (union) in the denominator.Â
Mathematically:
Where A and B are the prediction and ground truth bounding boxes or masks, I denote the intersection area, and U is the union area.
This ratio produces a value between 0 and 1:
Object detection consists of two sub-tasks:Â
But IoU focuses purely on localization. It only cares about how well the box was placed, not the object's class.
Let's look at a simple example. Suppose an image has a ground truth box for a cat with an area of 1000 pixels. If our model predicts a box overlapping 800 pixels of the ground truth (intersection), and the total area of the predicted box plus the ground truth (union) is 1200 pixels, then we calculate IoU:
An IoU value of 0.67 shows a fairly good overlap, indicating that the model’s localization is close to the ground truth.
Ground truth data refers to verified, true data used to teach, validate, and test machine learning algorithms.
In computer vision, the IoU metric depends on the accuracy of this ground truth data (labeled data) when comparing results. If the ground truth annotations are wrong or inconsistent, even a perfect model will receive misleading IoU scores.
That’s why creating quality annotated datasets matters most in any vision project. It starts with curating good raw data that covers enough scenarios, including different object scales, angles, and lighting conditions.
LightlyOne helps curate raw data better in the quantity we want, and helps save time, effort, and annotation costs.
We can use the LightlyOne Selection feature to choose data using different strategies, such as DIVERSITY, TYPICALITY, or the BALANCE strategy to focus on specific classes.
Here’s a sample code example of configuring a LightlyOne selection. For more details, see our complete guide for common selection use cases.
You can view and explore the selected dataset interactively on the LightlyOne Platform.
Once you have quality raw data, you can start labeling each object in the image (assigning a bounding box) that you want the model to detect. This process is called annotation or labeling. Labeling is usually done with specialized labeling software.
There are many data annotation tools available, and you can choose the one that best fits your needs.Â
Software like LabelImg, CVAT, or commercial platforms such as V7 Darwin and Encord help in creating ground truth bounding boxes through manual labeling.Â
Some of these tools use AI-assisted labeling, but human-verified ground truth is necessary for evaluation.Â
Ensuring the ground truth is error-free is crucial since the IoU scores will show them as errors in the model.
Annotating an image involves specifying the object's location using four numbers that represent its bounding box coordinates.Â
There are two common formats for this:
We can also get ground truth from public datasets like COCO (Common Objects in Context), Pascal VOC, or ImageNet, which provide pre-annotated images with bounding boxes.Â
After annotating, split your data into a training and testing sets and a validation set.
The training set usually makes up 70-80% of the total data, validation 10-15%, and testing 10-15%.Â
Separation ensures that when we calculate IoU scores on the test set, we assess the model's ability to handle new data, not just recall what it has already seen in training.
Now that we understand the concepts of IoU, let's dive deeper into its calculation. Assume we have two boxes, A and B, with corners marked by superscripts. Each box is defined by its upper-left (x0, y0) and bottom-right (x1, y1) corners.Â
First, let's calculate the Intersection area.
The image below shows how we can find the upper-left (x0^I, y0^I) and bottom-right (x1^I, y1^I) corners of the intersection between two overlapping boxes. A superscript I indicates the intersection area.
So, the coordinates of the intersection rectangle are as follows:
Then, we can calculate the intersection area as follows:
Let’s now calculate the union area. The union of two bounding boxes equals the sum of their areas minus the intersection area.
Where:
Finally, we can calculate the IoU as follows:
Here is a simple Python implementation of the steps above. This source code makes the logic clear.
For practical applications, you would often use a vectorized Numpy implementation to compute IoUs for thousands of boxes at once, which is much more efficient.
Now that we can calculate IoU for any pair of boxes, to evaluate an object detection model, we use an IoU threshold. This is a key step for benchmarking various object detection algorithms.
An IoU threshold is a cutoff value (between 0 and 1) that we choose to decide if a detection is correct or not.Â
For a given predicted bounding box, we compare it to the corresponding ground truth bounding box.
Other types of errors also count as False Positives, such as predicting an object where there is none, or detecting the same object as a duplicate. A False Negative (FN) occurs when there is a ground truth object in the image, but the model fails to detect it.
‍
Using these TP, FP, and FN counts, we calculate Precision and Recall:
If we make our model more sensitive, it might find more objects (higher recall) but also make more mistakes (lower precision). The IoU threshold directly affects this, as a low threshold (such as 0.3) favors recall with more TPs, while a high threshold (0.8) favors precision with fewer TPs.
To avoid this trade-off, we use Average Precision (AP) to get a single number that sums up the model's performance across all thresholds. Mathematically:
Where p(r) is the measured precision at recall r.
AP is the area under the Precision-Recall curve (PR curve). A higher AP means the model maintains high precision even as recall increases.Â
In most datasets, we have multiple object classes. We calculate the AP for each class individually. Then calculate the mean Average Precision (mAP), which is simply the average of the AP scores across all classes.
mAP is the most important evaluation metric for object detection benchmarks.Â
All new models, like YOLO or others, almost always report their performance as an mAP score on a standard dataset like COCO.
The IoU is also a crucial metric in image segmentation, used to measure the overlap between two shapes.
Image segmentation involves dividing an image into smaller regions where each part has similar features or qualities.Â
Pro tip: Read about Instance Segmentation and Semantic Segmentation.
The segmentation model's output is a "mask," which is a set of pixels predicted to belong to a certain object. The ground truth is also a mask, and we compare these two masks to evaluate the model's prediction.
The IoU formula remains the same, but we apply it at the pixel level.
This is equivalent to calculating TP / (TP + FP + FN), where:
Similar to object detection, segmentation tasks often involve multiple classes, and we calculate IoU for each class separately. Then take the mean Intersection over Union (mIoU), which is the average of these individual IoU scores.Â
Although IoU is a great evaluation metric, it has some drawbacks when used as a loss function to train a neural network.Â
A loss function's role is to show the model how wrong its prediction is and guide it toward a better one.
‍
Here is why to go beyond IoU:
To address these shortcomings of the standard IoU, researchers introduced Generalized Intersection over Union (GIoU). It improves IoU by using the smallest convex object C that encloses both bounding boxes (A and B).Â
Here is the pseudocode for calculating the GIoU:
Then the GIoU loss will be calculated as:
Now, even when IoU is zero (no overlap), GIoU is not zero. It becomes a negative value that gets closer to 0 as the predicted box moves closer to the ground truth. It provides the needed gradient for the model to learn and solves the zero IoU problem.
The pseudocode for calculating the bounding box losses:
GIoU loss outperforms IoU and MSE loss functions, which result in considerable performance gains in object detection models like YOLOv3.
After GIoU, other researchers introduced DIoU (Distance-IoU) and CIoU (Complete-IoU) to improve the convergence even faster and more stably.
DIoU enhances bounding box regression by directly minimizing the normalized distance between the centers of the predicted and ground truth boxes. It makes the loss less dependent on box orientation.Â
The equation for DIoU loss is:
CIoU extends DIoU by also considering differences in aspect ratio between the boxes. It helps ensure that the predicted box has a similar shape (width-to-height ratio) as the ground truth box.
‍
The CIoU loss function is:
where,
Alpha is a trade-off parameter (function of IoU) and is defined as:
Both DIoU and CIoU show much faster convergence than IoU and GIoU, reaching lower regression errors.
Here is the comparison summary table:
‍
Here are some tips to improve the IoU when building an object detector or segmentation models:Â
Getting high IoU scores in object detection and segmentation starts with training on data that prepares the model for your specific use case.
Public datasets like COCO and ImageNet are useful, but they often lack the unique features of specific industrial or real-world domains. LightlyTrain addresses this by using self-supervised learning. It lets you use your own unlabeled data (images or videos) to build stronger models.Â
Instead of starting with generic ImageNet weights, LightlyTrain adapts your model to the specific details of your domain before you begin the fine-tuning with labeled data.
Pretraining on your unlabeled data helps the model learn relevant features, so when you fine-tune on a smaller, curated dataset (like one selected by LightlyOne), it learns faster and more effectively. This results in improved localization accuracy and, as a result, higher IoU scores.
Intersection over Union (IoU) is a key metric that brings clarity to computer vision evaluation, when used with other metrics like mAP. It gives one score that compares the box the model predicts to the true box to see how close they are.
Knowing about IoU helps you better diagnose localization errors and make smarter training choices, from setting evaluation thresholds to using advanced loss functions. With the right tools and data curation, achieving high IoU scores becomes easier and faster.
Get exclusive insights, tips, and updates from the Lightly.ai team.
See benchmarks comparing real-world pretraining strategies inside. No fluff.