Semantic Segmentation: A Practical Guide

Table of contents

Semantic segmentation is a computer vision technique that labels each pixel in an image with a class, enabling detailed scene understanding. Unlike image classification or object detection, it provides pixel-level precision. This guide explains how it compares to instance and panoptic segmentation, covers key models like U-Net and DeepLab, and highlights applications in self-driving cars, medical imaging, and more.

Ideal For:
ML Engineers and AI researchers
Reading time:
10 mins
Category:
Models

Share blog post

Quick summary of key points about AI model training techniques and their implementation.

TL;DR
  • What is semantic segmentation? 

It’s a computer vision task where each pixel in an image is classified into a category (e.g. road, car, person). In semantic segmentation (also called semantic image segmentation), all pixels belonging to the same class are labeled with that class, producing a segmentation mask that outlines those regions​.

  • How is semantic segmentation different from image classification or object detection? 

Image classification assigns a single label to an entire image, while object detection draws bounding boxes around objects. Semantic segmentation goes further by providing a pixel-level labeling: instead of boxes, it outputs a segmentation map with a class label for every pixel. This captures object boundaries much more precisely​.

  • Semantic vs. instance vs. panoptic segmentation? 

Semantic segmentation groups all identical objects (same class) together (all “cars” pixels share one label), whereas instance segmentation separates each object instance (each car is a separate mask). Panoptic segmentation combines both – it provides a unified mask where each pixel has a class label and, for countable objects, an instance ID​.

  • What are common semantic segmentation models? 

Many segmentation models use deep learning. Early models like the Fully Convolutional Network (FCN) converted classification CNNs to produce pixel-wise outputs​. Advanced architectures (e.g. U-Net, DeepLab, Pyramid Scene Parsing Network (PSPNet)) use an encoder-decoder architecture with techniques like skip connections, atrous spatial pyramid pooling, and multi-scale context to capture fine details and sharper object boundaries.

  • Where is semantic segmentation used? 

It’s used in self-driving cars (segmenting roads, pedestrians, etc.), medical imaging (medical image segmentation of tumors or organs), satellite image analysis (land cover segmentation), robotics vision, and any automated system that needs to understand an image at the pixel level. For example, in medical image processing, segmentation helps delineate tumors for diagnosis.

Over the last few years, modern computer vision has moved far beyond simple classification. Tasks like autonomous driving, robotics, and medical imaging require models that can parse visual input at a much finer granularity. In this article, we dive into semantic segmentation—a foundational technique for pixel-level understanding of images.

We’ll explore the following:

  1. What is Semantic Image Segmentation
  2. How Semantic Segmentation Works
  3. Semantic Segmentation Models and Deep Learning
  4. Training Semantic Segmentation Models (Loss Function)
  5. Real-world Applications of Semantic Segmentation

What is Semantic Image Segmentation?

Semantic image segmentation is the process of classifying each pixel in an image into a specific object category—such as “person,” “car,” or “tree.” The output is a segmentation map or mask, where each pixel is replaced with a class label and visualized using color coding.

This differs from instance and panoptic segmentation, which we will clarify shortly.

How Semantic Segmentation Works

Semantic segmentation assigns a class label to each pixel in an input image, producing a segmentation map that highlights different objects and regions. Segmentation relies on deep learning models, typically convolutional neural networks (CNNs), which analyze the image, extract features, and predict the class of each pixel based on its surrounding context.

Figure 1: Semantic segmentation process using the SegNet model 
Figure 1: Semantic segmentation process using the SegNet model.

To train a semantic segmentation model, we need images and their corresponding segmentation masks (ground truth).  Human annotators create these masks manually by labeling every pixel using annotation tools. This process presents some challenges:

  • Time and Cost: Manually labeling each pixel in large datasets is slow and expensive, especially for detailed or high-resolution images.
  • Class Imbalance: Some classes may have many more pixels than others, leading to biased models.
  • Annotation Consistency: Different annotators may interpret object boundaries or ambiguous regions differently, leading to inconsistencies in the labels.
  • Class Ambiguity: Some pixels may belong to more than one class or have unclear boundaries.

Techniques like data augmentation expand the dataset to tackle these issues, while weighted loss functions help the model focus on rare classes. Using multiple annotators and combining their work can also improve mask quality.

💡 Pro Tip: Techniques like Self-supervised learning pretrain models effectively with limited labeled data, reducing the need for manual annotation. Discover how Self-supervised learning applies to computer vision.

Semantic Segmentation vs. Instance Segmentation vs. Panoptic Segmentation

There are three main types of image segmentation tasks in computer vision:

  • Semantic Segmentation: labels every pixel into a category, grouping all pixels of the same class together (no distinction between separate instances of that class).

  • Instance Segmentation: also produces pixel-wise masks, but distinguishes different instances of the same class (e.g., two people have separate masks).

  • Panoptic Segmentation: unifies the above, assigning every pixel a class label and ensuring that instances of “thing” classes (countable objects like people, cars) are separately labeled, while “stuff” classes (e.g. sky, road) are continuous regions​.

Compared to object detection models, which output bounding boxes, segmentation provides finer localization. For instance, object detection might put a box around a dog, whereas semantic segmentation will outline the exact silhouette of the dog (all pixels belonging to the dog)​. 

This means segmentation can capture fine details like the shape of objects and sharper object boundaries (no background included in a bounding box). 

Similarly, unlike image classification which loses spatial information by collapsing an image into a single label, semantic segmentation preserves spatial context – we know which part of the image is which object. The table below summarizes these differences:

Table 1: Comparison of Various Computer Vision Tasks
Task Output Granularity
Image Classification One label for the entire input image. Example: “This image contains a cat.” Whole image (no spatial information).
Object Detection Bounding box + label for each object. Example: “Cat at [x,y, width, height].” Per object, coarse outline (box).
Semantic Segmentation Pixel-wise class label map (all pixels classified). Example: Segmentation mask highlighting all cat pixels. Per pixel, grouped by class (no instance separation).
Instance Segmentation Pixel-wise masks for each object instance + class labels. Example: A Separate mask for each cat. Per pixel, each object instance is separate.
Panoptic Segmentation Unified pixel-wise map: every pixel has a class, with instances of “thing” classes uniquely labeled. Example: Masks for each cat plus background classes like sky. Per pixel, complete scene labeling (combines instance + stuff).
Image Captioning A textual description of the image (no pixel or box output). Whole image: provides a descriptive summary without localization or segmentation.

Semantic Segmentation Models and Deep Learning

Modern semantic segmentation is dominated by deep learning algorithms – particularly CNN-based architectures that can learn rich feature representations. Below we outline the evolution of important segmentation models and how they address the challenges of pixel-wise prediction:

  • Fully Convolutional Network (FCN): FCNs replaced fully connected layers with convolutional ones for pixel-wise predictions, using an encoder-decoder pattern and skip connections.
  • Encoder-Decoder Architectures (U-Net and SegNet): U-Net is a symmetric encoder-decoder structure with skip connections, while SegNet uses pooling indices for efficient upsampling.
  • Dilated Convolutions and Multi-Scale Context (DeepLab series): DeepLab models utilize atrous convolutions and Atrous Spatial Pyramid Pooling (ASPP) for multi-scale context. DeepLabv3+ additionally includes a decoder for improved boundary segmentation.
  • Pyramid Scene Parsing Network (PSPNet): PSPNet uses a pyramid pooling module to aggregate global and local context. ​

Other Notable Architectures: HRNet keeps high-resolution representations for precise segmentation, while transformer-based models like MaskFormer use global attention mechanisms.

💡 Pro Tip: Feature representations are often captured as embeddings, which are crucial for understanding image content and similarity. Discover the importance of embeddings in deep learning models.

The table below summarizes semantic segmentation models and highlights their key architectural features.

Table 2: Comparison of Different Deep Learning Models
Model & Year Architecture Highlights Key Features Innovations
FCN (2015) Fully convolutional version of VGG/AlexNet; upsampling via learned deconvolutions. Pixel-wise prediction for any input image size; merges coarse and fine image features. First, the end-to-end segmentation CNN introduced skip connections.
U-Net (2015) Symmetric encoder-decoder (U-shape) with extensive skip connections. Captures context and fine details via contracting and expanding paths; high accuracy with limited data. Excels in medical image segmentation, ensuring perfect and complete overlap with the ground truth.
SegNet (2015) Encoder-decoder using pooling indices for upsampling. Efficient upsampling with a lighter decoder; balances speed and accuracy. Ideal for road scenes despite fewer skip connections.
DeepLabv3+ (2018) ResNet backbone with atrous convolutions; ASPP module; encoder-decoder. Maintains resolution with dilated convolutions; captures multi-scale context. ASPP refines sharper object boundaries in complex segmentation maps.
PSPNet (2017) ResNet-based encoder; pyramid pooling module on final feature map. Integrates global context; high accuracy for "stuff" segmentation. New pyramid pooling layer; achieves 85.4% mIoU on VOC.
MaskFormer (2021) Vision Transformer-based, query-based segmentation. Global attention for long-range context; unifies semantic and instance segmentation. Pushes state-of-the-art performance on major datasets.

See Lightly in Action

Curate data, train foundation models, deploy on edge today.

Book a demo

Training Semantic Segmentation Models (Loss Functions)

Training a semantic segmentation model required teaching the network to predict the correct class for every pixel. This is typically done by minimizing a pixel-wise loss function over a dataset of images and their corresponding segmentation masks, also known as ground truth. It’s like training many little classifiers (one per pixel) that share the same CNN backbone for feature extraction. 

The loss function measures the difference between the model's predictions and the ground truth, guiding it to improve. Tools like Lightly Train can streamline this process by optimizing data selection and model training, resulting in efficient and high-quality results. 

💡 Pro Tip: Advanced pre-training techniques like contrastive learning help models learn powerful representations that make downstream tasks like pixel classification more effective.

There are several loss functions and considerations unique to segmentation:

Pixel-wise Cross-Entropy Loss

This loss treats each pixel as a separate classification task. It compares predicted class probabilities to the true labels for every pixel and averages the errors. It works well, but can falter with class imbalance. Pixel-wise loss is the log loss summed across all classes. 

Class Imbalance Remedies – Weighted and Focal Loss

Weighted cross-entropy assigns higher importance to rare classes to address class imbalance, while focal loss focuses training on hard-to-classify pixels by reducing the weight of easy examples.

Figure 2: Weighted and focal loss function
Figure 2: Weighted and focal loss function.

Dice Loss

Dice loss focuses on the overlap (IoU) between predicted and true masks. It uses the dice coefficient, which ranges from 0 to 1, to estimate the overlap of predicted label pixels with the ground truth. It’s great for imbalanced datasets and ensures better performance on underrepresented ones.

Figure 3: Dice loss function
Figure 3: Dice loss function.

Other Losses

Advanced loss functions, such as Tversky loss and IoU-based losses, address specific challenges in segmentation tasks, including managing small object regions and emphasizing boundary accuracy.

Regularization and Augmentation

Techniques such as flipping, rotating, or scaling images create diverse training examples, improving robustness and helping the model handle variations in spatial information. While regularization techniques, such as dropout or weight decay, prevent overfitting during training, they also ensure the model performs well on new data.

Applications of Semantic Segmentation

One of the reasons semantic segmentation is so actively researched is its wide range of real-world applications. Segmentation improves many systems by enabling an automated understanding of multiple objects and regions in images with pixel-level precision:

Autonomous Vehicles (Driving Scene Understanding)

Self-driving cars and advanced driver assistance systems use segmentation to parse road scenes. For example, segmentation models label pixels as "road," "pedestrian," or "vehicle" in real-time, helping cars determine where the drivable road is, where lane markings are, and the exact shape of obstacles or pedestrians. This pixel-level spatial information ensures safe maneuvering around obstacles, outperforming coarse object detection for complex road shapes.

Medical Image Segmentation

Segmentation is used to outline anatomical structures or abnormalities in modalities like MRI, CT scans, ultrasound, or microscopy. For example, the U-Net architecture and encoder-decoder models are used to segment tumors, delineate organs, and perform segmentation of blood vessels and cells. Medical image segmentation can assist in diagnosis, treatment planning, and guiding surgeries, helping to identify structures to avoid.

Figure 4: Brain tumor segmentation in MRI scans.
Figure 4: Brain tumor segmentation in MRI scans.

Aerial and Satellite Imagery

Segmenting satellite images (remote sensing) is important for mapping and environmental monitoring. Tasks include land cover segmentation (water, forest, urban, and agricultural areas), road and building extraction, and damage assessment, such as segmenting buildings destroyed in a disaster. They help in creating maps, such as labeling regions with what’s on the ground, monitoring deforestation, and urban growth, among other things.

Figure 5: Geospatial building boundary detection from satellite imagery.
Figure 5: Geospatial building boundary detection from satellite imagery.

Robotics and Scene Understanding

Robots that operate in homes or warehouses use segmentation to understand their surroundings better. For instance, a domestic robot with a camera can use segmentation to distinguish between floors, walls, furniture, and people, which helps with navigation and object manipulation. In factories, robots can segment objects on a conveyor belt to determine where to pick them up. Segmentation can also be used in drone vision to identify landing zones or targets.

Figure 6: Semantic scene understanding.
Figure 6: Semantic scene understanding.

Augmented Reality (AR) and Image/Post-processing

Semantic segmentation is used in many image manipulation and AR tasks. For example, consider the “portrait mode” on smartphone cameras, which blurs the background. It uses segmentation to separate the person (foreground) from the background. In video conferencing, virtual background effects use real-time segmentation to cut out the person from their background. 

Even in creative tools, segmentation models are working behind the scenes to provide a mask of the object the user wants to cut out. In AR, placing virtual objects requires understanding the surfaces and objects present. Some apps segment the user’s hand to enable realistic occlusion of virtual content behind it.

Figure 7: AR with deep learning for facility segmentation and depth prediction.
Figure 7: AR with deep learning for facility segmentation and depth prediction.

Agriculture

For example, in agriculture, semantic segmentation differentiates crops from weeds. It also segments fruits on trees in drone imagery for yield estimation, using neural networks to map field regions for disease detection. Similarly, in document analysis, it separates text from backgrounds or layout regions, enhancing image processing with pixel-wise accuracy.

Figure 8: Semantic image interpretation in agriculture.
Figure 8: Semantic image interpretation in agriculture.

LightlyOne in Action: Curating High-Impact Data for Semantic Segmentation

Even the best segmentation model will underperform if the training data isn’t representative, diverse, or properly balanced. In practice, datasets often suffer from class imbalance, annotation bottlenecks, or redundant samples that waste labeling effort and compute resources.

LightlyOne is designed to solve these challenges head-on. It applies intelligent data curation to identify the most informative and diverse subset of your unlabeled data—prioritizing samples that improve generalization and cover edge cases. For semantic segmentation, this means fewer irrelevant frames, better coverage of rare classes, and a reduced burden on manual labeling teams.

From class-aware sampling to embedding-based similarity filtering, LightlyOne integrates directly into vision pipelines to help teams:

  • Pre-select optimal images before annotation
  • Reduce dataset size without sacrificing performance
  • Continuously refine training data based on model feedback

Learn more here: https://www.lightly.ai/lightlyone

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Stay ahead in computer vision

Get exclusive insights, tips, and updates from the Lightly.ai team.