📣 Big news: LightlyStudio is now live! Try it for free.

Swin (Shifted Window Transformer) Explained: An Overview

Swin Transformer is a Vision Transformer with a shifted window attention mechanism, enabling efficient scaling to high-resolution images. It combines CNN-like multi-scale features with Transformer flexibility, powering state-of-the-art vision tasks.

Ideal For:

CV Engineers

Reading time:

10 mins

Category:

Models

Share blog post

Below we answer some common questions about Swin Transformers:

TL;DR

What is Swin Transformer?

Swin is a hierarchical vision transformer model that uses a Shifted Window (SWIN) approach to compute self-attention in local windows for efficiency. It serves as a general-purpose backbone for image recognition tasks, achieving state-of-the-art results in image classification, object detection, and segmentation.

What does “Swin” stand for?

“Swin” stands for Shifted Window, referring to its core technique of shifting the attention windows between transformer layers. This shifted window scheme limits self-attention to non-overlapping local windows (for linear complexity) while still allowing cross-window connections in deeper layers.

Why was Swin Transformer developed?

It was created to overcome limitations of previous models. Traditional CNNs have a limited receptive field (local-only focus), and earlier vision transformers like ViT suffer from quadratic complexity on high-resolution images and lack multi-scale feature representation. Swin Transformer introduces a hierarchical multi-stage design and local windowed attention to handle high-resolution images and scale variation in vision tasks more efficiently.

Is Swin Transformer better than CNNs?

Swin Transformer has outperformed many CNN backbones on benchmark tasks by capturing long-range dependencies and multi-scale context. For example, it achieved 87.3% top-1 on ImageNet-1K and significantly improved object detection (e.g. +2.7 Box AP over previous state-of-the-art on COCO). However, CNNs may still excel in low-data or low-compute scenarios due to their strong inductive biases.

How does Swin compare to other vision Transformers (like ViT)?

Unlike ViT which applies global self-attention (all patches attend to each other, O(n²) complexity), Swin uses window-based self-attention (limiting attention to local regions for linear scaling). Swin’s hierarchical architecture also produces multi-scale feature maps (like CNNs) whereas ViT yields a single-scale representation. This makes Swin more efficient and effective for tasks like detection and segmentation that require fine and coarse details.

‍

Introduction

Vision Transformers (ViTs) , introduced first in 2020, have been quite successful in extending the capabilities in sequence-to-sequence modeling, first applied to NLP tasks using the original Transformer architecture.

ViTs operate in a similar fashion, by taking in a sequence of embedded image patches as input to a standard transformer. Since then, ViT has been the dominant competitor over CNNs (Convolutional Neural Networks) based networks which use backbones such as ResNet or EfficientNet.

Figure 1: Swin-Transformer. — **Figure 1:** Swin-Transformer.

Training these ViTs however, is a different ball-game altogether.

Although they demonstrate performance close or better than CNNs, they often require large scale pretraining on image datasets such as ImageNet21K or JFT300M, each containing millions of images for them to work.

This drawback has primarily influenced the rise of several variants of ViTs, such as Swin, DeiT and many others which try to improve its efficiency and effectiveness from various angles.

Keeping these various limitations and aspects in mind, in this blog we will cover the following topics in detail:

1.What are Swin transformers
2. Swin’s architecture and core features
3. Swin Transformer vs. CNN backbones

4. Real-world applications of Swin trasnformers

While Swin and similar architectures reduce the compute burden of ViTs, pretraining still plays a critical role in unlocking their full potential. That’s where Lightly comes in:

LightlyTrain: Pretrain and fine-tune Swin Transformer models on curated, domain-specific data to accelerate training and improve downstream performance.
LightlyOne: Select the most diverse and informative samples to maximize learning efficiency with less data.

Together, they help you train powerful vision transformers, without relying on massive generic datasets like ImageNet-21K.

See Lightly in Action

Curate data, train foundation models, deploy on edge today.

Book a Demo

Background: The Evolution of Vision Transformers

Learning multi-scale feature representations has always been one of the major goals for improving performance in multiple downstream vision tasks such as classification, detection and segmentation.

So, not surprisingly, there have been attempts to learn such hierarchical representations using various modified versions of CNNs (multi-branch architectures such as Big-Little-Net) or two-stream architectures such as SlowFast where each branch encodes different rates of the same video and at different spatial resolutions. These simple but powerful feature aggregation techniques are what powered the popularity behind CNNs and the reason behind their widespread applicability.

Figure 2: SlowFast’s Two-Stream Architecture. — **Figure 2:** SlowFast’s Two-Stream Architecture.

Therefore, unsurprisingly, these concepts were soon transferred to ViTs and its variants as well. Before that, let’s do a quick recap of how the attention mechanism in ViTs work and how it functions by taking in an image and producing an output.

**Figure 3:** Mathematical Formulation of ViT.

From Figure 3, we can see that ViTs inherently capture the global interaction between all the image tokens but which leads to the massive compute cost of self-attention: the complexity is quadratically linear to the length of the input sequence.

The isotropic structure further inhibits the capability of ViTs to perform tasks which require high-resolution details: super-resolution, tracking or object detection.

Enter Swin Transformer: it proposes to solve these two major pain points: It (a) restricts the quadratic cost of attention by restricting attention to non-overlapping windows and (b) building the pyramid via patch merging enabling effective coarse to fine reasoning overcoming the absence of hierarchical reasoning in ViTs.

Below is a simple step by step approach to how ViTs work:

Start with the patch embeddings (e.g, 4x4 stride) to form tokens.
Apply W-MSA within MxM windows which enables local interactions.
Next block applies SW-MSA (cyclically shift tokens), thus enabling windows to overlap with boundaries, creating cross-window links.
After a couple of blocks, merge the obtained patches (downsample spatially, increase channel dimension) and repeat at the next stage.
Alternate the W-MSA and SW-MSA till the top of the pyramid and add a downstream head (MLP for classification, FPN for detection and so on.)

Next, we will get into the details of how these blocks inside the swin architecture actually works!

Swin Transformer Architecture

We can think of Swin as having four stages in its architecture.

It begins with low-channel width and highest spatial resolution; each subsequent stage halves HxW via patch merging and doubles channels.

Figure 4: Swin-Transformer Architecture. — **Figure 4:** Swin-Transformer Architecture.

Then the attention is computed within fixed MxM windows (e.g., M=7 for 224x224 inputs). Tokens do not attend outside their windows in this block, bounding the cost to O(#windows x M^2 x d) and addressing global quadratic growth. To avoid any isolated windows, the next block shifts the window grid (e.g., M/2) using cyclic shifts. Attention is again computed locally, but because windows now straddle previous boundaries, tokens exchange information across prior windows.

Within a fixed window size M, attention cost scales roughly linearly with the image area (HxW). Alternating the W-MSA and SW-MSA blocks, expands the effective receptive field, stacking few such blocks gives nearly global context without going to O((HxW)^2).

Swin also adds a learnable relative bias per attention head indexed by relative token displacement within a window. This improves translation handling and stabilizes learning across various resolutions.

Figure 5: Patching Operation in Images. — **Figure 5:** Patching Operation in Images.

Now, let’s dive deep into the two main blocks of Swin: the W-MSA Block and the SW-MSA Block and how they leverage self-attention to aggregate the global information in an efficient manner.

W-MSA

Figure 6: SWIN’s windowed self-attention mechanism. — **Figure 6:** SWIN’s windowed self-attention mechanism.

We can break down W-MSA (Window-based Multi-head Self-Attention) broadly into these 4 concepts:-

Local attention by construction
W-MSA splits the feature map into non-overlapping M×M windows and applies standard multi-head self-attention inside each window. Tokens only interact with others in the same window.

Near-linear scaling with image size
With fixed M, attention cost becomes O(HW⋅M^2) instead of O((HW)2)O((HW)^2)O((HW)2). Implementations batch windows by reshaping to (B \cdot \text{#windows}, M^2, C), enabling high GPU utilization and cache locality.

Built-in spatial bias without absolutes
Each head uses a learnable relative position bias indexed by token displacement within the window. This preserves translation handling and transfers cleanly across input resolutions without absolute positional embeddings.

Receptive field behavior
A single W-MSA block models rich local context only. Global communication emerges when W-MSA is alternated with shifted windows in the next block (SW-MSA), expanding the effective receptive field while keeping the compute bounded.

SW-MSA

The W-MSA block is responsible for modeling the local features present in the image and preserving it. The SW-MSA block on the other hand works in tandem to capture the global context of the image by operating on disjoint receptive fields.

Figure 7: Intra and Inter self-attention for cross-window communication. — **Figure 7:** Intra and Inter self-attention for cross-window communication.

Put simply, it can be broken down into the following points:
1) Cross-window communication via cyclic shift
SW-MSA rolls the feature map by (typically s=⌊M/2⌋) before window partitioning, runs W-MSA inside those shifted windows, then rolls back. Windows now straddle prior boundaries, letting tokens mix across neighboring windows.

2) Attention masking for correctness
The cyclic shift wraps tokens from opposite edges; a blockwise attention mask is added to the logits to prevent attending across these artificial wrap boundaries. This preserves local structure while enabling real cross-window interactions.

3) Expanded receptive field without quadratic cost
Alternating W-MSA and SW-MSA connect adjacent windows so information propagates beyond a single window in just two blocks. With fixed M, complexity remains O(HxWxM^2), but effective context grows rapidly across layers.

Efficient Computation Using these two Alternating blocks

When swin alternates between W-MSA and SW-MSA, we get efficient local mixing plus fast cross-window propagation without ever paying global O((HxW)^2) attention. A W-MSA block restricts interaction to fixed M×M windows; the following SW-MSA block cyclically shifts by s ⁣= ⁣⌊M/2⌋, partitions again, and attends with a mask—so tokens now exchange information across the previous window boundaries. Stacking a few such pairs quickly expands the effective receptive field to near-global, while complexity remains O(HxWxM^2)) for fixed M.

In practice, the 4-stage pyramid (with patch merging between stages) ensures both fine features at high resolution and semantic features at low resolution, making Swin a drop-in backbone for dense tasks (FPN, UPerNet).

Now that we are familiar with the core architecture of the swin transformer, the significance of it’s W-MSA and SW-MSA blocks and how they work in an alternating fashion, we now explore its benefits across a host of domains: in applications in downstream vision tasks, speed and efficiency as well as its generalizability.

Why Swin Transformer? Core Features and Benefits

The main strength of the swin transformer lies in the fact that it combines better receptive fields with efficiency and speed.

This means it can be adopted for a wide variety of downstream tasks in computer vision which were being solved by using CNN based backbone architectures earlier.

We discuss some of the main applications where swin demonstrably improves the task performance while delivering state-of-the-art results. Apart from strong performance, swin’s scalability and adaptiveness across long-horizon tasks is another impressive feat compared to CNN based architectures where overfitting became a gradual problem.

Figure 8: Swin vs ViT for downstream tasks. — **Figure 8:** Swin vs ViT for downstream tasks.

Adapting transformers for vision tasks: Coming to the core features of swin, its windowing capability imposes a locality prior akin to convolutions while retaining the global context with content adaptive mixing between the two blocks.

The hierarchical pyramid supplies the multi-scale features ViTs initially lacked, aligning with downstream tasks such as detection and segmentation as shown in Figure 8. The pyramid-like visual feature extraction process combined with a mechanism for utilizing attention across all non-overlapping windows is basically combining the best of both worlds: the strong feature invariance properties of CNNs and the sequence attention modeling capabilities of transformer based backbones.

Figure 9: ImageNet-1K classification and COCO object detection performance. — **Figure 9:** ImageNet-1K classification and COCO object detection performance.

State-of-the-art performance: Further, the large Swin variant shows strong performance with ImageNet-1K TOp-1 accuracy and substantially boost COCO/AP and ADE20K/mIoU when paired with standard heads which proves Swin’s ability to be adapted to any pipeline in a post-hoc fashion.

Even more, it also shows strong performance on segmentation tasks (on the CityScapes Dataset for example), showing how good of a generalizable backbone it is.

**Figure 10:** Performance using ViT backbone across multiple image resolutions.

High-resolution and variability of scales: Windowed attention keeps compute bounded as input resolution scales, making megapixel images feasible. The pyramid captures small and large objects by combining features across stages.

This is illustrated in Figure 10, which although is not particularly targeted towards swin-transformers but makes a case that transformer based architecture can easily handle images till 1024x1024p which is enough for most modern day encoders to process and perform downstream tasks.

Efficiency and Scalability: The shifted scheme proposed in swin is quite hardware friendly (fixed size key sets per window) and avoids the high latency of techniques like sliding window attention. Mixed precision and compile-time fusion (e.g., FlashAttention style kernels) help reduce memory/latency in modern stacks.

Comparison: Swin Transformer vs. CNNs vs. Other Vision Transformers

Here’s the comparison of CNN vs ViT vs Swin Transformer.

Table 1: CNN vs ViT vs Swin Transformer.
Feature/Aspect	Convolutional Neural Networks (CNNs)	Vision Transformer (ViT)	Swin Transformer (Hierarchical ViT)
Architecture Type	Stacked convolution layers + pooling; strong spatial locality bias.	Transformer encoder layers applied to image patches; minimal inductive bias.	Transformer encoder layers with window-based attention + patch merging hierarchy. Combines local attention with multi-scale structure.
Receptive Field and Context	Local receptive fields per layer; global context only after many layers (limited initial global understanding).	Global self-attention from the first layer (every patch attends to every other), providing immediate global context.	Local self-attention within windows (each layer has limited view), but shifted windows ensure gradual global context accumulation. Achieves full-image receptive field after a few layers.
Computational Complexity	Approximately linear in image size per layer (each convolution covers a constant-size neighborhood). Scales well to larger images, though deep CNNs can be heavy.	Quadratic in number of patches (global attention) – high cost for large images. ViT requires extensive computation or reduced resolution for big inputs.	Linear in image size due to windowed attention. Much more efficient on high-res images than ViT, similar order of complexity to CNNs (with small constant factors for attention within windows).
Feature Hierarchy	Yes – naturally multi-scale (via pooling/strides). Produces a pyramid of feature maps (e.g., image -> feature maps at 1/2, 1/4, 1/8 resolution, etc.).	No inherent hierarchy – original ViT maintains one sequence of patches at a single scale. Lacks multi-scale features unless added separately.	Yes – hierarchical (multi-stage patch merging). Produces multi-scale feature maps like CNNs, enabling detection/segmentation use cases out-of-the-box.
Inductive Bias	High inductive bias: locality and translation invariance encoded by design. Good for generalization with less data, but may miss long-range dependencies unless explicitly engineered.	Low inductive bias: learns spatial relationships from data. Needs more data to train effectively, but in theory can learn any relationship. Can capture patterns unconstrained by locality, at the cost of data efficiency.	Moderate inductive bias: encourages locality via windows, and hierarchy imposes structure. Still flexible like a transformer, but with biases that help training on limited data and improve transfer to various tasks.
Strengths	Efficient on smaller images or tasks with mostly local features. Well-established, with many variations (ResNet, EfficientNet) and optimized implementations (fast CNN convolutions on hardware). Strong performance when large labeled datasets are not available (benefits from bias).	Captures global context and long-range interactions naturally. Very high capacity for representation (can outperform CNNs given enough data). Simpler architecture (homogeneous transformer layers) which scales to very large models.	Excellent all-round performance across vision tasks; state-of-the-art results in classification, detection, segmentation (as of 2021). Scales to high-resolution inputs efficiently (windowing avoids quadratic blowup). Provides multi-scale features for dense tasks, something ViT lacked. Retains some CNN-like properties (locality) while leveraging transformer flexibility.
Weaknesses	Limited ability to model long-range dependencies (requires deeper networks or added attention mechanisms). Architecture design can be complex (need tuning of channels, layers, etc. for different tasks). May be outperformed by transformers on very large-scale data.	Computationally heavy for large images due to global self-attention. Requires large datasets or pretraining to reach full potential (less effective in low-data regimes). Single-scale feature output; not directly ideal for detection/segmentation without modification (e.g., feature pyramid networks).	More complex to implement than standard CNNs (requires handling the window partitioning and merging logic). Slightly less straightforward global interaction – some global relationships are only captured after multiple layers (not instant like ViT’s global attention). Like all transformers, can still be data-hungry (benefits from large-scale pretraining on ImageNet-22K or similar).

‍

Applications of Swin Transformer

Finally, let’s take a look at some of the most common use cases for Swin Transformers.

Open-Vocabulary / Long-Tailed Detection

Detic integrates image-level tags with a standard detector and uses Swin-B as the backbone to scale to long-tail vocabularies (e.g., LVIS). With 896×896 inputs and a Swin-B backbone, Detic reports strong mask mAP across all classes, and explicitly breaks out rare/common/frequent categories, demonstrating that hierarchical attention + large-vocab supervision mitigates tail underperformance without quadratic attention costs.

This is how Swin is used in a long-tailed regime rather than “generic” detection tasks (in which it already shows strong performance).

Figure 11: Long-Tail Detection using Swin as Backbone — **Figure 11:** Long-Tail Detection using Swin as Backbone

Scene Graph Generation

Learning to Generate Language-Supervised and Open-Vocabulary Scene Graph (VS3) tackles SGG under language supervision and open-vocabulary constraints.

The model uses Swin backbones (Swin-T → Swin-L) to produce multi-scale visual tokens that fuse with text via cross-attention; simply upgrading from Swin-T to Swin-L yields notable recall gains over prior art, highlighting the benefit of Swin’s hierarchical receptive fields for relation understanding beyond box-level semantics.

Image Restoration/Super-Resolution

SwinIR: Image Restoration Using Swin Transformer refactors Swin into residual Swin Transformer blocks (RSTB) for SR, denoising, and JPEG artifact removal. The architecture keeps windowed + shifted attention for efficiency while stacking deep Swin layers for long-range dependency, reporting up to ~0.14–0.45 dB PSNR improvements over prior SOTA with fewer parameters in several tracks—an illustrative low-level vision use where hierarchical attention shines without CNN inductive biases.

Figure 13: Image Super-resolution with Swin. — **Figure 13:** Image Super-resolution with Swin.

Challenges and Future Directions

Despite Swin’s strongest points being able to reduce the computational cost to almost linearish, memory can still grow with many windows/stages. Further, global interactions are not instantaneous (unlike ViT’s global MSA) and require few alternating blocks. Window size and stride also requires significant tuning in some cases.

SwinV2 addresses some of these challenges by scaling parameters and introducing stabilizing changes (e.g, post-norm residuals with cosine attention) and improved pretraining objectives (MIM). Multi-modal extensions (vision+text) and video spatiotemporal windows are active areas.

Its compatibility with large vision-language models is also an open area for exploration.

Implementation and Resources

Official implementation: Microsoft has made the official implementation of Swin publicly available: https://github.com/microsoft/Swin-Transformer
Hugging Face and other libraries: Pretrained weights are widely available. timm exposes many Swin variants; Hugging Face transformers provide SwinForImageClassification/SwinModel APIs, plus Swinv2* models.

We also provide some simple code implementations for the concepts discussed above.

Simple Window Partitioning

Below is the code snippet which loads the swin-tiny transformer and implements a simple window partitioning with an input image of size (HxW) and with M as the default window size and then applies the cyclic shifting logic across these windows as well.

**Figure 14:** Window and cyclic shift implementation.

Image Classification Code with Swin-Tiny model

Below is a simple code snippet which loads the swin-tiny transformer and uses the transformers package from huggingface to load the models, then do a simple forward pass over the pre-loaded image and then get the output predictions.

Figure 15: Classification example with swin-tiny. — **Figure 15:** Classification example with swin-tiny.

Conclusion

Swin Transformer marries transformer flexibility with CNN-like hierarchy via efficient windowed attention and shifted windows.

In practice it serves as a strong, general-purpose backbone that scales to high resolutions and dense tasks while remaining compatible with modern training stacks.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.

Book a Demo

Stay ahead in computer vision

Get exclusive insights, tips, and updates from the Lightly.ai team.

Swin (Shifted Window Transformer) Explained: An Overview