YOLO (You Only Look Once) is a real-time object detection model known for its speed and accuracy. Learn how YOLO works, explore the different model versions and tools, and discover real-world use cases from autonomous driving to surveillance.
YOLO (You Only Look Once) is a real-time object detection algorithm that treats detection as a single regression problem. A single neural network predicts multiple bounding boxes and class probabilities for objects in one pass over the image. This one-stage approach makes YOLO extremely fast compared to traditional two-stage detectors.
YOLO divides the input image into a grid and predicts bounding boxes (with coordinates for each box) and confidence scores for objects in those grid cells. If an object's center falls in a grid cell, that cell is responsible for detecting it. The network outputs the box coordinates, objectness score, and class probabilities for each predicted box, then uses Non-Maximum Suppression to filter overlapping detections. Unlike region proposal methods, YOLO processes the entire image in one forward pass – hence "you only look once".
The YOLO family has evolved from YOLOv1 (2016) to YOLOv8 (2023), each improving accuracy and speed. For example, YOLOv2 introduced anchor boxes and batch normalization for better localization. YOLOv3 added a deeper backbone (Darknet-53) and multi-scale predictions (detecting small objects better). YOLOv4 incorporated CSPNet and mosaic data augmentation to further boost performance. Modern versions like YOLOv5 to YOLOv8 focus on lighter models, new neural network layers, and easier training, keeping YOLO state-of-the-art in real-time detection.
YOLO is used in any application requiring fast object detection. Notable examples include autonomous driving (detecting cars, pedestrians in real time), video surveillance (people or package detection on security cameras), robotics (for vision in drones or industrial robots), and even medical imagery (e.g., detecting anomalies in scans). Its ability to detect objects in live video at high FPS makes it ideal for embedded vision systems and edge aplications.
YOLO is available in open-source implementations. The original C/C++ Darknet framework (by Joseph Redmon) provides pre-trained YOLOv1-v4 models.. For easier use, Python-based libraries like Ultralytics YOLOv5/YOLOv8 offer pretrained models on COCO and simple APIs to detect objects in images or video. You can fine-tune YOLO on a custom dataset by annotating images with bounding boxes and training the network (many tutorials and GitHub repos guide this). Because YOLO is open-source, a large community has built tools, extensions, and improvements around it, making it accessible even if you’re not training from scratch.
YOLO (You Only Look Once) is one of the most popular object detection models. It is known for its speed and accuracy. It processes images in real time, making it useful for applications like autonomous driving, surveillance, and robotics.
Here we will cover:
By the end, you'll understand how YOLO works, its strengths and trade-offs, and how to use it for various object detection tasks.
The YOLO is a real-time object detection model which can process an entire image in a single pass. Introduced by Joseph Redmon et al. in 2015, YOLO reframed object detection as a single end-to-end regression problem. It directly maps image pixels to bounding box co-ordinates and class probabilities. This design made YOLO significantly faster than previous approaches.
Previously the popular object detection models like Fast R-CNN used a two stage approach. They would first generate region proposals and then classify them, which made object detection complex and cannot be processed in real-time.
YOLO used a single convolutional neural network (CNN) and eliminated the region proposal step. This was revolutionary as it made the process simple and enabled real-time detection with competitive accuracy.
The first version of YOLO had lower localization accuracy compared to two-stage methods, but later versions (YOLOv2, v3, etc.) closed this gap. The ability to process at 30+ FPS with high mean Average Precision (mAP) made YOLO practical for real-time applications like video analysis, drone vision, and mobile object detection.
Object detection models are typically categorized into: two-stage and one-stage detectors. The key difference lies in how they process an image to detect objects.
Two-Stage Detectors: High Accuracy, Slower Speed
Two-stage detectors, like Faster R-CNN, break object detection into two separate steps:
This method is highly accurate as the deep learning model focuses on likely object regions and then classifies the potential objects. However, this also adds computations and makes it slower, typically achieving 5-7 FPS on a high-end GPU.
One-Stage Detectors: Faster, Real-Time Performance
One-stage detectors, like YOLO and Single Shot Detection or SSD, skip the region proposal step and predict bounding boxes and class labels in a single network pass. This direct approach makes them significantly faster.
YOLO was one of the first one-stage detectors to achieve high accuracy, outperforming earlier single-shot models like SSD.
While YOLO dominates in speed, it’s useful to understand how it compares with other detection frameworks.
Faster R-CNN uses a Region Proposal Network (RPN) to generate ~300 object regions before classification. It achieves high accuracy but runs at roughly 5-7 FPS with a ResNet-101 backbone. YOLOv3, in contrast, runs at 20-45 FPS with slightly lower accuracy. While two-stage models historically had better localization for small objects, YOLOv7 has surpassed many two-stage models in accuracy.
Single shot multibox detector or SSD also uses a single stage approach like YOLO with multi-scale feature maps and anchor boxes. But with YOLO v4 there was a significant improvement in the accuracy over SSD.
The RetinaNet used Focal Loss to handle class imbalance in classification and achieved accuracy comparable to two-stage detectors. It improved the detection of small objects but at the cost of speed. Later YOLO versions (v4, v5) outperformed RetinaNet in both speed and accuracy, making YOLO the better choice for real-time tasks.
EfficientDet has a pre trained model as a backbone followed by BiFPN as a feature network. This improved the accuracy, but at lower speeds. EfficientDet-D4 matched YOLOv4’s accuracy but ran at ~8-11 FPS, while YOLOv4 achieved 62 FPS. Even EfficientDet-D7X, the most accurate variant, was slower than YOLOv7. YOLOv7 outperformed it in accuracy as well.
How YOLO Object Detection Works:
Here are some of the key components involved in the YOLO object detection algorithm:
YOLO first divides the input image into an S x S grid. Now each grid cell is used to detect objects that falls within it.
Each bounding box is defined by its center coordinates, width, height, and a confidence score that indicates the likelihood of an object being present. The model also assigns class probabilities to each grid cell, allowing it to identify different objects in a single inference step.
The algorithm often predicts multiple overlapping boxes for the same object. To eliminate duplicates, the Non Maximum Algorithm filters out boxes with lower confidence. This ensures that the detections are not redundant.
This score represents the probability that a bounding box contains an object and how well the predicted box fits the object. The confidence score here is different from the object confidence score assigned earlier to each bounding box.
The earlier versions of YOLO struggled with small object detection tasks. YOLOv2 introduced multi-scale feature extraction and used FPN to detect objects at different resolutions.
YOLO's single-shot architecture enables real-time performance, even on local machines. Its balance of speed, accuracy, and accessibility has made it widely adopted in various applications.
Since its introduction in 2015, YOLO has evolved through multiple versions. It improved on YOLO architecture in each iteration to enhance accuracy, efficiency, and adaptability over the years. Here is an overview of each iteration:
YOLOv1 was the first to unify object detection into a single neural network. It used a 24 layer CNN similar to GoogLeNet and predicted two bounding boxes per cell across 20 classes.
Key Features
Performance
YOLOv1 achieved 63.4% mAP on PASCAL VOC 2007 at 45 FPS on a GPU. However, it had lower accuracy than region-based methods like Faster R-CNN, particularly for small and overlapping objects.
Impact
YOLOv1 proved real-time object detection was feasible on a single GPU.
YOLOv2, also called YOLO9000, improved upon v1 by incorporating anchor boxes, a new backbone (Darknet-19), and batch normalization, allowing detection of over 9000 object categories via joint training on ImageNet.
Key Features
Performance
YOLOv2 runs at 67 FPS with a 78.6 mAP on VOC 2007, and 21.6% AP on VOVO at 40 FPS. It surpasses YOLOv1 in both speed and accuracy.
Impact
YOLOv2 bridged the performance gap with state-of-the-art detectors while maintaining real-time speed, making it practical for industry applications.
YOLOv3 introduced a deeper backbone (Darknet-53) compared to Darket-19 used in YOLOv2. It uses the backbone with residual connections and a feature pyramid network (FPN) for multi-scale object detection tasks.
Key Features
Performance
YOLOv3 achieved 3o FPS with 33% mAP while significantly improving detection accuracy, especially for small objects. It was a strong competitor to Faster R-CNN, SSD and RetinaNet and is 3-4 times faster. Hence it was preferred for practical applications.
Impact
YOLOv3 became a widely used real-time detector, balancing speed and accuracy. However, in 2020, Redmon ceased research on YOLO, leaving further development to the community.
Developed by Bochkovskiy, Wang, and Liao, YOLOv4 improved both speed and accuracy using CSPDarknet-53 as a backbone and numerous architectural optimizations.
Key Features
Performance
YOLOv4 reached 62 FPS on a Tesla V100, offering a superior speed-accuracy balance. It surpassed YOLOv3 in both mAP and efficiency.
Impact
YOLOv4 established itself as the top choice for real-time object detection in 2020, gaining widespread adoption in research and industry.
YOLOv5, released by Ultralytics, was the first major YOLO version implemented in PyTorch. While not an official research paper, it became extremely popular due to its ease of use and modular framework.
Key Features
Performance
It was faster compared to YOLOv4 in terms of training and inference. The accuracy was competitive across benchmarks.
Impact
YOLOv5 became widely adopted in computer vision applications because of its ease of use and performance. Also, it was optimized for mobile deployment.
YOLOv6 was developed by Meituan and optimized for industrial applications. It focused on efficiency and introduced an anchor-free architecture to improve detection accuracy and speed.
Key Features
Performance
It achieves a higher FPS than YOLOv5 while maintaining competitive accuracy.
Impact
It was optimized for edge deployment in industrial scenarios. Hence, it was widely used in manufacturing and automation due to its low latency and high speed inference capabilities.
YOLOv7 was developed by WongKinYiu and AlexeyAB as an independent research effort, focusing on balancing speed and accuracy. It introduced efficient reparameterization techniques.
Key Features
Performance
It is faster and more accurate than YOLOv6 and YOLOv5. It achieved higher mAP at lower latency compared to previous YOLO versions.
Impact
It was used in real-time video analytics and robotics due to its high accuracy and efficiency.
YOLOv8 refined previous improvements with a more flexible architecture, optimized for various real-world applications.
Key Features
Performance
YOLOv8 achieved higher accuracy and better generalization while keeping real-time performance intact. It remains one of the most widely used single-shot object detection models today.
Impact
It became popular version of YOLO due to its ease of use and high accuracy. It was commonly used in autonomous vehicles, surveillance and retail analytics.
YOLOv9 introduced a hybrid anchor-free detection approach, optimizing speed and accuracy. It refined feature aggregation and backbone efficiency for improved small-object detection in real-time applications.
Key Features
Performance
YOLOv9 demonstrates improved mAP over YOLOv8, with reduced latency, making it suitable for applications requiring swift and accurate object detection. YOLOv9 is also capable of performing object detection, segmentation, and classification tasks.
Impact
Better application in various industries because of its use.
YOLOv10 also known as YOLOE integrated transformer-based feature extraction, boosting performance in complex real-world scenarios. It improved generalization across diverse datasets while reducing computational overhead.
Key Features
Performance
YOLOv10 variants exhibit significant improvements over previous versions, achieving up to 54.4% APval with reduced latency. It is optimized for real-time edge computing applications, processing images at up to 1000fps.
Impact
YOLOv10 offers a range of model zises to accommodate different computational resources and accuracy needs. This efficiency driven design has set new benchmarks for real-time object detection, making it ideal for applications in resource constrained environments.
YOLOv11 shifts from purely CNN based architecture to transformer based backbone. It introduces dynamic head design for improving accuracy with fewer parameters. It supports tasks such as object detection, segmentation, classification, keypoint detection, and oriented bounding box detection.
Key Features
Performance
YOLOv11 outperforms previous versions in speed and accuracy on COCO dataset. It processes at 60 FPS with a mean Average Precision (mAP) of 61.5%, and fewer parameters, making it suitable for a range of applications.
Impact
It utilizes a better neck and backbone architecture, enhancing feature extraction capabilities for more precise object detection. YOLOv11 also expanded object detection use cases, particularly for dense scenes and complex environments.
YOLOv12 integrates attention mechanisms into the YOLO framework. This design combines CNN speed with transformer-based enhancements.
Key Features
Performance
It shows a 25% improvement in detection accuracy in poor lighting. The multiple object tracking enhances object tracking in motion heavy scenarios.
Impact
Sets a new benchmark in object detection with improved speed and accuracy. This makes YOLOv12 particularly effective in applications such as autonomous driving, security surveillance, and industrial automation.
Take a look at this comparison table.
To train, fine-tune or even infer a YOLO model, you will need the right tools. Here are the libraries, frameworks and deployment solutions you will need to train, fine-tune or even infer your YOLO:
It is a fan favorite for a good reason. It is flexible, easy to debug and has great support for GPU acceleration. Most modern YOLO versions are built on PyTorch as well.
You can install it with:
pip install torch torchvision
TersorFlow and Keras
YOLOv3 and YOLOv4 have TensorFlow implementations, and TensorFlow Lite (TFLite) makes it easy to deploy on mobile devices.
To get started:
pip install tensorflow
Darknet is where YOLO started. It’s a C-based framework built for speed. While newer YOLO versions have moved to PyTorch, Darknet still supports YOLOv1 to YOLOv4 and YOLOv7 versions.
You can start with:
git clone https://github.com/pjreddie/darknet
cd darknet
make
It is a PyTorch-based implementation that simplifies training, fine-tuning, and deployment of YOLO models.
pip install ultralytics
Now to run inference on an image, run:
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
results = model("image.jpg")
A PyTorch-based object detection framework developed by OpenMMLab. If you want more customization, then it is a solid choice. It’s modular and great for large-scale training.
pip install mmdet
If you need to run YOLO models on a different hardware, then ONNX Runtime helps you convert your model so it can run on CPUs, GPUs or even on AI chips.
import onnxruntime as ort
session = ort.InferenceSession("yolov8.onnx")
YOLO’s combination of speed and accuracy has led to its adoption in a wide range of fields. Here are some prominent use cases:
YOLO (You Only Look Once) has come a long way, evolving into one of the fastest and most efficient single shot object detection models out there. From YOLOv1 to the latest YOLOv12, each version has pushed the boundaries of speed, accuracy, and efficiency.
While YOLO remains a top choice for real-time vision tasks, selecting the right version ensures optimal performance for specific use cases.
Get exclusive insights, tips, and updates from the Lightly.ai team.