LightlyTrain Introduces EoMT for Semantic Segmentation

LightlyTrain now supports EoMT (Encoder-only Mask Transformer) for semantic segmentation. By removing decoder modules and integrating query tokens directly into the ViT encoder, EoMT simplifies design and boosts inference speed by up to 4× while preserving accuracy.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Table of contents

Product
LightlyTrain
Category:
New Feature
Use Case
5 mins

EoMT (Encoder-only Mask Transformer) is now available in LightlyTrain, enabling efficient semantic segmentation with transformer encoders. Unlike traditional architectures, EoMT removes decoder modules and instead injects query tokens directly into the ViT encoder. This simplifies the model design and improves inference speed by up to 4× while maintaining strong accuracy.

Why EoMT

  • Encoder-only design, no pixel or transformer decoder

  • Supports semantic, instance, and panoptic segmentation under a single framework

  • Better throughput for real-world deployments

  • Compatible with multiple DINOv3 backbones (vits16, vitb16, vitl16, etc.)

You can train this model on your own dataset with just a few lines of code:

import lightly_train

if __name__ == "__main__":
    # Train a model.
    lightly_train.train_semantic_segmentation(
        out="out/my_experiment",
 model="dinov2/vitl14-eomt",
 data={...},
    )

    # Load the trained model.
    model = lightly_train.load_model_from_checkpoint(
        "out/my_experiment/checkpoints/last.ckpt"
    )

    # Run inference.
    masks = model.predict("path/to/image.jpg")

Benchmarks & Performance Across Datasets

Here’s a snapshot of how EoMT with DINOv3 stacks up in LightlyTrain across standard segmentation datasets.

Results reported on an NVIDIA T4 GPU. FPS values are given at the listed input resolutions.

In ADE20K, for example, dinov3/vitl16-eomt hits 59.1% mIoU, surpassing the DINOv2 EoMT baseline of 58.4%.

Why Use EoMT in LightlyTrain

  • Simplicity + efficiency: No decoders or complex adapters. Fewer modules means faster inference and easier integration.
  • Unified architecture: Supports semantic, instance, and panoptic segmentation.
  • Performance + speed trade-off: With DINOv3 backbones, EoMT achieves competitive accuracy while maintaining strong throughput.
  • Seamless pipeline support: Easily integrated into the LightlyTrain workflow, you can fine-tune your segmentation models with minimal boilerplate.

Try It Today

Ready to try it? Head over to the LightlyTrain docs to start experimenting:
👉 Semantic Segmentation — EoMT + DINOv3 docs

Dive deeper into how EoMT works and the architectural decisions behind it in our blog post: 👉 EoMT: Your ViT is Secretly an Image Segmentation Model

See Lightly in Action

Curate data, train foundation models, deploy on edge today.

Book a Demo

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Explore Lightly Products

Lightly One

Data Selection & Data Viewer

Get data insights and find the perfect selection strategy

Learn More

Lightly Train

Self-Supervised Pretraining

Leverage self-supervised learning to pretrain models

Learn More

Lightly Edge

Smart Data Capturing on Device

Find only the most valuable data directly on device

Learn More