Zero-Shot Learning: A Practical Guide
What is Zero-Shot Learning
Zero-shot learning (ZSL) is a machine learning technique that enables models to recognize and classify objects, concepts, or categories they have never encountered during training - without needing any labeled examples of those unseen classes.
Zero-shot learning relies on knowledge transfer from seen to previously unseen classes, allowing the model to generalize beyond its training data.
Unlike traditional supervised learning, which requires extensive labeled data for every category, zero-shot learning leverages auxiliary information-such as semantic attributes, text descriptions, or relationships between classes-to bridge the gap between known (seen) and unknown (unseen) categories.

For example, if a model is trained to recognize cats and dogs but has never seen a zebra, it can still identify a zebra by using a description like "a horse-like animal with black and white stripes". The model connects this new description to its existing knowledge of animals, allowing it to infer the correct label even without direct examples.
Zero-Shot vs. One-Shot vs. Few-Shot: Comparison
Before diving deeper, it’s helpful to contrast zero-shot learning with its close cousins: one-shot learning and few-shot learning, as well as standard supervised learning.
These all fall under the umbrella of “n-shot learning,” where n refers to the number of examples of each new class provided during training.
- Supervised Learning (Traditional): Models need many labeled examples per class and can't recognize classes they haven't seen, e.g., a model trained on dogs and cats won’t recognize a bird.
- Few-Shot Learning (FSL): Models learn from a handful of labeled examples (like 5–10) per new class, using meta-learning or fine-tuning to adapt quickly.
- One-Shot Learning (OSL): A special case of few-shot learning where the model is given exactly one labeled example of the new class to learn from. For example, given one photo of a new species of flower, a one-shot model tries to recognize that flower in future images.
- Zero-Shot Learning (ZSL): No labeled examples of the target classes are given. The model must rely on external knowledge (attributes, textual descriptions, etc.) and what it learned from other classes to make the leap to unseen classes. It’s the most extreme case – analogous to identifying an animal species you’ve never seen based purely on reading a description of it.
This spectrum shows a key tradeoff: more examples improve accuracy but reduce flexibility.
How Does Zero-Shot Learning Models Work?
First, here are some key terms you need to know:
- Seen Classes: The set of categories the model is trained on, with labeled examples available during training (e.g., cats, dogs, horses).
- Unseen Classes: Categories that are absent from the training data; the model must generalize to these at test time using indirect evidence.
- Auxiliary Information: Additional data that describes both seen and unseen classes, such as semantic attributes, textual descriptions, or word embeddings. This information enables the model to connect visual features to new categories it has never encountered.

At the heart of zero-shot learning is the idea of using indirect evidence to make predictions about unseen classes. Since the model has never been trained on the target class, we must answer: how can it recognize something it's never seen?
The solution is to represent classes in terms of a shared semantic space that connects unseen and seen classes through auxiliary information. This typically involves a two-step process:
Training (Learning General Representations)
During the training phase, the model learns to map visual features from seen classes into a shared embedding space where they can be associated with semantic representations.
As shown in the diagram, a pretrained embedding network (θ) processes labeled images from seen classes (cats, horses, dogs) through supervised learning. The key insight is that this embedding must capture transferable visual features rather than class-specific patterns.

The model simultaneously learns to encode auxiliary information (attributes, textual descriptions, or class prototypes) into the same semantic space using techniques like:
- Semantic embeddings that map both visual and textual modalities to a common representation, as done by this CVPR 2022 Workshop paper.

- Variational autoencoders that generate latent representations bridging visual and semantic domains, as done in this CVPR 2022 paper.

- Multi-attention mechanisms that discover discriminative visual parts guided by semantic descriptions, as done by the SGMA model.

Inference (Transferring to Unseen Classes)
During inference, the magic happens through cross-modal alignment. When presented with an unseen zebra image, the pretrained embedding network extracts visual features without updating its parameters. Simultaneously, auxiliary information ("a horse-like animal with black and white stripes") gets encoded through a text encoder into the same semantic space.
A projection network then performs compatibility matching between these representations. Recent generalized zero-shot learning (GZSL) approaches use sophisticated techniques to:
- Distinguish between seen and unseen domains using Wasserstein distance and dual variational autoencoders.
- Generate synthetic examples of unseen classes using class-conditioned generative models, like the U-SAM model which uses StableDiffusion to generate samples for zero-shot segmentation.

- Calibrate confidence scores to prevent bias toward seen classes during classification, like the Deep Calibration Network (DCN).

The final prediction emerges from finding the unseen class with maximum compatibility in this shared semantic space, effectively enabling the model to "imagine" what a zebra looks like based on its learned understanding of horse-like features combined with stripe patterns.
How to Choose a Zero-Shot Learning Method for your Use Case
The choice between zero-shot learning approaches depends on your specific requirements and constraints. Consider your auxiliary information type (attributes vs. textual descriptions), computational constraints, and whether you need generalized zero-shot learning (recognizing both seen and unseen classes simultaneously) to guide your method selection.
The broad class of methods for ZSL that are popular in the literature include:
1. Classifier-based
Classifier-based methods work best when you have rich semantic descriptions and need interpretable decision boundaries. These methods construct explicit classifiers for unseen classes using correspondence between visual features and semantic prototypes.

The U-SAM framework (CVPR 2025) is a strong example of a classifier-based ZSL method. It extends the Segment Anything Model (SAM) by integrating a semantic classifier that allows the model to recognize and segment novel object categories using only class names as input.
During training, U-SAM leverages synthetic images generated from text prompts or web-crawled images to provide semantic grounding for each class, bypassing the need for real labeled data. The model employs a multi-layer perceptron (MLP) classifier head that learns to map SAM’s generic mask embeddings to specific object categories, guided by these semantic cues.
What sets U-SAM apart is its ability to perform zero-shot segmentation: at inference, it requires only the class names to predict semantic masks for previously unseen categories and datasets. This approach demonstrates the power and flexibility of classifier-based ZSL, as U-SAM constructs explicit classifiers for new classes based on semantic information rather than visual examples.
The result is a system that achieves significant improvements over baselines in open-world segmentation tasks, highlighting the effectiveness of combining foundation models with semantic classifier heads for scalable zero-shot recognition.

2. Instance-based
Instance-based methods excel when you have strong visual similarities between seen and unseen classes, as they focus on matching individual instances rather than building explicit classifiers.
For example, SDGZSL introduces a novel method called Semantic Neighborhood Borrowing (SNB). Unlike projection-based or segmentation-focused methods, SNB operates purely at the semantic level by identifying and leveraging implicit relationships between seen and unseen class attributes.

The method constructs a semantic graph where nodes represent classes (both seen and unseen) and edges encode attribute similarity. For each unseen class, SNB identifies its k-nearest semantic neighbors among seen classes using attribute cosine similarity.
During training, the model borrows instance features from these neighboring seen classes to build a prototype for the unseen class, weighted by their semantic proximity. This allows the model to create pseudo-instances for unseen classes without synthetic feature generation or visual comparisons.

On the AWA2 benchmark, SNB achieved 72.1% accuracy on unseen classes (+8.3% over prior instance-borrowing methods) while maintaining 64.9% accuracy on seen classes in generalized ZSL settings. This demonstrates how instance-based methods can effectively leverage semantic relationships without requiring visual similarity assumptions or synthetic data pipelines.
3. Generative Approaches
Generative approaches using GANs or VAEs are ideal when you need robust performance across diverse unseen classes, as they can synthesize training examples for better generalization.
For example, ZeroDiff (ICLR 2025) introduces a generative zero-shot learning framework designed to overcome the problem of spurious visual-semantic correlations, which become especially pronounced when training data is scarce. ZeroDiff employs a diffusion-based generative model that augments limited seen-class data by generating a wide variety of noised samples.
This diffusion augmentation helps prevent overfitting and enables the generative model to better capture the underlying distribution of visual features, even with just a fraction of the original training data.

A key innovation in ZeroDiff is its use of supervised-contrastive (SC) representations, which dynamically encode the distinctive characteristics of each training sample. These representations guide the generation of visual features, ensuring that the synthesized data remains semantically meaningful and diverse.

Additionally, ZeroDiff employs multiple discriminators, each evaluating generated features from a different perspective - predefined semantics, SC-based representations, and the diffusion process itself. These discriminators are trained using a Wasserstein-distance-based mutual learning approach, which further solidifies the alignment between visual and semantic information. As a result, ZeroDiff achieves robust zero-shot learning performance, outperforming previous generative methods and maintaining high accuracy even when only 10% of the seen-class training data is available.
