📣 Big news: LightlyStudio is now live! Try it for free.

LightlyTrain Introduces EoMT for Semantic Segmentation

LightlyTrain now supports EoMT (Encoder-only Mask Transformer) for semantic segmentation. By removing decoder modules and integrating query tokens directly into the ViT encoder, EoMT simplifies design and boosts inference speed by up to 4× while preserving accuracy.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.

Book a Demo

Product

LightlyTrain

Category:

New Feature

Reading time

5 mins

EoMT (Encoder-only Mask Transformer) is now available in LightlyTrain, enabling efficient semantic segmentation with transformer encoders. Unlike traditional architectures, EoMT removes decoder modules and instead injects query tokens directly into the ViT encoder. This simplifies the model design and improves inference speed by up to 4× while maintaining strong accuracy.

Pro tip

👉 Documentation: EoMT in LightlyTrain
👉 Architecture overview

Why EoMT

Encoder-only design, no pixel or transformer decoder
Supports semantic, instance, and panoptic segmentation under a single framework
Better throughput for real-world deployments
Compatible with multiple DINOv3 backbones (vits16, vitb16, vitl16, etc.)

You can train this model on your own dataset with just a few lines of code:

import lightly_train

if __name__ == "__main__":
    # Train a model.
    lightly_train.train_semantic_segmentation(
        out="out/my_experiment",
 model="dinov2/vitl14-eomt",
 data={...},
    )

    # Load the trained model.
    model = lightly_train.load_model_from_checkpoint(
        "out/my_experiment/checkpoints/last.ckpt"
    )

    # Run inference.
    masks = model.predict("path/to/image.jpg")

‍

Benchmarks & Performance Across Datasets

Here’s a snapshot of how EoMT with DINOv3 stacks up in LightlyTrain across standard segmentation datasets.

Results reported on an NVIDIA T4 GPU. FPS values are given at the listed input resolutions.

In ADE20K, for example, dinov3/vitl16-eomt hits 59.1% mIoU, surpassing the DINOv2 EoMT baseline of 58.4%.

Why Use EoMT in LightlyTrain

Simplicity + efficiency: No decoders or complex adapters. Fewer modules means faster inference and easier integration.
Unified architecture: Supports semantic, instance, and panoptic segmentation.
Performance + speed trade-off: With DINOv3 backbones, EoMT achieves competitive accuracy while maintaining strong throughput.
Seamless pipeline support: Easily integrated into the LightlyTrain workflow, you can fine-tune your segmentation models with minimal boilerplate.