Zero-Shot Learning: A Practical Guide

‍

What is Zero-Shot Learning

Zero-shot learning (ZSL) is a machine learning technique that enables models to recognize and classify objects, concepts, or categories they have never encountered during training - without needing any labeled examples of those unseen classes.

Zero-shot learning relies on knowledge transfer from seen to previously unseen classes, allowing the model to generalize beyond its training data.

Unlike traditional supervised learning, which requires extensive labeled data for every category, zero-shot learning leverages auxiliary information-such as semantic attributes, text descriptions, or relationships between classes-to bridge the gap between known (seen) and unknown (unseen) categories.

Figure 1: Context Harvesting. — **Figure 1:** Context Harvesting.

For example, if a model is trained to recognize cats and dogs but has never seen a zebra, it can still identify a zebra by using a description like "a horse-like animal with black and white stripes". The model connects this new description to its existing knowledge of animals, allowing it to infer the correct label even without direct examples.

Zero-Shot vs. One-Shot vs. Few-Shot: Comparison

Before diving deeper, it’s helpful to contrast zero-shot learning with its close cousins: one-shot learning and few-shot learning, as well as standard supervised learning.

These all fall under the umbrella of “n-shot learning,” where n refers to the number of examples of each new class provided during training.

Supervised Learning (Traditional): Models need many labeled examples per class and can't recognize classes they haven't seen, e.g., a model trained on dogs and cats won’t recognize a bird.
Few-Shot Learning (FSL): Models learn from a handful of labeled examples (like 5–10) per new class, using meta-learning or fine-tuning to adapt quickly.
One-Shot Learning (OSL): A special case of few-shot learning where the model is given exactly one labeled example of the new class to learn from. For example, given one photo of a new species of flower, a one-shot model tries to recognize that flower in future images.
Zero-Shot Learning (ZSL): No labeled examples of the target classes are given. The model must rely on external knowledge (attributes, textual descriptions, etc.) and what it learned from other classes to make the leap to unseen classes. It’s the most extreme case – analogous to identifying an animal species you’ve never seen based purely on reading a description of it.

Table 1: Few-Shot Learning vs. Zero-Shot Learning.
Aspect	Supervised Learning	One-Shot Learning	Few-Shot Learning	Zero-Shot Learning
Examples per class	100s-1M+	1	Typically 5-100	0
Data dependency	High	Moderate	Low	None
Knowledge Source for New Class	All knowledge is learned from the labeled training data (no explicit external info)	That single example + prior knowledge (often uses transfer learning or similarity to known classes)	Those few examples + meta-learning or fine-tuned from pre-trained model	Auxiliary info (attributes, descriptions, embeddings); Relies on pre-trained models/knowledge
Best use case	Stable environments	Rare objects	Evolving domains	Unseen categories

‍

This spectrum shows a key tradeoff: more examples improve accuracy but reduce flexibility.

How Does Zero-Shot Learning Models Work?

First, here are some key terms you need to know:

Seen Classes: The set of categories the model is trained on, with labeled examples available during training (e.g., cats, dogs, horses).
Unseen Classes: Categories that are absent from the training data; the model must generalize to these at test time using indirect evidence.
Auxiliary Information: Additional data that describes both seen and unseen classes, such as semantic attributes, textual descriptions, or word embeddings. This information enables the model to connect visual features to new categories it has never encountered.

Figure 2: An overview of how ZSL works. Image by the author. — **Figure 2**: An overview of how ZSL works. Image by the author.

At the heart of zero-shot learning is the idea of using indirect evidence to make predictions about unseen classes. Since the model has never been trained on the target class, we must answer: how can it recognize something it's never seen?

The solution is to represent classes in terms of a shared semantic space that connects unseen and seen classes through auxiliary information. This typically involves a two-step process:

Training (Learning General Representations)

During the training phase, the model learns to map visual features from seen classes into a shared embedding space where they can be associated with semantic representations.

As shown in the diagram, a pretrained embedding network (θ) processes labeled images from seen classes (cats, horses, dogs) through supervised learning. The key insight is that this embedding must capture transferable visual features rather than class-specific patterns.

Figure 3: Embedding illustration. — **Figure 3:** Embedding illustration.

The model simultaneously learns to encode auxiliary information (attributes, textual descriptions, or class prototypes) into the same semantic space using techniques like:

Semantic embeddings that map both visual and textual modalities to a common representation, as done by this CVPR 2022 Workshop paper.

**Figure 4:** Overview of Joint Embeddings for Zero-Shot Learning.

Variational autoencoders that generate latent representations bridging visual and semantic domains, as done in this CVPR 2022 paper.

Figure 5: Variational Autoenconders. — **Figure 5:** Variational Autoenconders.

Multi-attention mechanisms that discover discriminative visual parts guided by semantic descriptions, as done by the SGMA model.

Figure 6: SGMA Framework. — **Figure 6:** SGMA Framework.

Inference (Transferring to Unseen Classes)

During inference, the magic happens through cross-modal alignment. When presented with an unseen zebra image, the pretrained embedding network extracts visual features without updating its parameters. Simultaneously, auxiliary information ("a horse-like animal with black and white stripes") gets encoded through a text encoder into the same semantic space.

A projection network then performs compatibility matching between these representations. Recent generalized zero-shot learning (GZSL) approaches use sophisticated techniques to:

Distinguish between seen and unseen domains using Wasserstein distance and dual variational autoencoders.
Generate synthetic examples of unseen classes using class-conditioned generative models, like the U-SAM model which uses StableDiffusion to generate samples for zero-shot segmentation.

Figure 7: U-SAM Architecture Overview. — **Figure 7:** U-SAM Architecture Overview.

Calibrate confidence scores to prevent bias toward seen classes during classification, like the Deep Calibration Network (DCN).

Figure 8: Deep Calibration Network for Zero-Shot Learning. — **Figure 8:** Deep Calibration Network for Zero-Shot Learning.

The final prediction emerges from finding the unseen class with maximum compatibility in this shared semantic space, effectively enabling the model to "imagine" what a zebra looks like based on its learned understanding of horse-like features combined with stripe patterns.

How to Choose a Zero-Shot Learning Method for your Use Case

The choice between zero-shot learning approaches depends on your specific requirements and constraints. Consider your auxiliary information type (attributes vs. textual descriptions), computational constraints, and whether you need generalized zero-shot learning (recognizing both seen and unseen classes simultaneously) to guide your method selection.

The broad class of methods for ZSL that are popular in the literature include:

1. Classifier-based

Classifier-based methods work best when you have rich semantic descriptions and need interpretable decision boundaries. These methods construct explicit classifiers for unseen classes using correspondence between visual features and semantic prototypes.

Figure 9: U-SAM Method Overview. — **Figure 9:** U-SAM Method Overview.

The U-SAM framework (CVPR 2025) is a strong example of a classifier-based ZSL method. It extends the Segment Anything Model (SAM) by integrating a semantic classifier that allows the model to recognize and segment novel object categories using only class names as input.

During training, U-SAM leverages synthetic images generated from text prompts or web-crawled images to provide semantic grounding for each class, bypassing the need for real labeled data. The model employs a multi-layer perceptron (MLP) classifier head that learns to map SAM’s generic mask embeddings to specific object categories, guided by these semantic cues.

Table 2: Traditional Zero-Shot Learning vs. U-SAM Approach.
Aspect	Traditional ZSL	U-SAM Approach
Input	Attributes/text	Class names only
Visual Features	Pre-defined embeddings	SAM's open-world masks
Scalability	Fixed class space	Dynamic class definitions

‍

What sets U-SAM apart is its ability to perform zero-shot segmentation: at inference, it requires only the class names to predict semantic masks for previously unseen categories and datasets. This approach demonstrates the power and flexibility of classifier-based ZSL, as U-SAM constructs explicit classifiers for new classes based on semantic information rather than visual examples.

The result is a system that achieves significant improvements over baselines in open-world segmentation tasks, highlighting the effectiveness of combining foundation models with semantic classifier heads for scalable zero-shot recognition.

Figure 10: Qualitative results obtained by U-SAM compared to the typical SAM model which cannot provide semantic labels for the segmented objects. — **Figure 10:** Qualitative results obtained by U-SAM compared to the typical SAM model which cannot provide semantic labels for the segmented objects.

2. Instance-based

Instance-based methods excel when you have strong visual similarities between seen and unseen classes, as they focus on matching individual instances rather than building explicit classifiers.

For example, SDGZSL introduces a novel method called Semantic Neighborhood Borrowing (SNB). Unlike projection-based or segmentation-focused methods, SNB operates purely at the semantic level by identifying and leveraging implicit relationships between seen and unseen class attributes.

Figure 11: SDG Zero-Shot Learning. — **Figure 11:** SDG Zero-Shot Learning.

The method constructs a semantic graph where nodes represent classes (both seen and unseen) and edges encode attribute similarity. For each unseen class, SNB identifies its k-nearest semantic neighbors among seen classes using attribute cosine similarity.

During training, the model borrows instance features from these neighboring seen classes to build a prototype for the unseen class, weighted by their semantic proximity. This allows the model to create pseudo-instances for unseen classes without synthetic feature generation or visual comparisons.

Figure 12: Results obtained by SDGZSL. — **Figure 12:** Results obtained by SDGZSL.

On the AWA2 benchmark, SNB achieved 72.1% accuracy on unseen classes (+8.3% over prior instance-borrowing methods) while maintaining 64.9% accuracy on seen classes in generalized ZSL settings. This demonstrates how instance-based methods can effectively leverage semantic relationships without requiring visual similarity assumptions or synthetic data pipelines.

3. Generative Approaches

Generative approaches using GANs or VAEs are ideal when you need robust performance across diverse unseen classes, as they can synthesize training examples for better generalization.

For example, ZeroDiff (ICLR 2025) introduces a generative zero-shot learning framework designed to overcome the problem of spurious visual-semantic correlations, which become especially pronounced when training data is scarce. ZeroDiff employs a diffusion-based generative model that augments limited seen-class data by generating a wide variety of noised samples.

This diffusion augmentation helps prevent overfitting and enables the generative model to better capture the underlying distribution of visual features, even with just a fraction of the original training data.

Figure 13: Standard GAN-based approach vs ZeroDiff. — **Figure 13:** Standard GAN-based approach vs ZeroDiff.

A key innovation in ZeroDiff is its use of supervised-contrastive (SC) representations, which dynamically encode the distinctive characteristics of each training sample. These representations guide the generation of visual features, ensuring that the synthesized data remains semantically meaningful and diverse.

Figure 14: Training pipeline of DFG. — **Figure 14:** Training pipeline of DFG.

Additionally, ZeroDiff employs multiple discriminators, each evaluating generated features from a different perspective - predefined semantics, SC-based representations, and the diffusion process itself. These discriminators are trained using a Wasserstein-distance-based mutual learning approach, which further solidifies the alignment between visual and semantic information. As a result, ZeroDiff achieves robust zero-shot learning performance, outperforming previous generative methods and maintaining high accuracy even when only 10% of the seen-class training data is available.

Figure 15: Performance comparison on limted training data. — **Figure 15:** Performance comparison on limted training data.

Improve your data

Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.

🎉 Big news: LightlyTrain now supports DINOv2. Read our announcement.

Back to blog

Zero-Shot Learning: A Practical Guide

Zero-shot learning allows models to recognize classes they were never trained on by using auxiliary knowledge like semantic descriptions or embeddings. It transfers understanding from known to unknown concepts, enabling predictions without labeled examples. Common in NLP and computer vision, it supports tasks like text classification or object recognition using only high-level descriptions of the new categories.

Ideal For:

ML Engineers & Researchers

Reading time:

10 mins

Category:

Models

Share blog post

Here is some key information on what Zero-Shot Learning and how it works.

TL;DR

What is zero-shot learning?

Zero-shot learning (ZSL) is a machine learning approach where a model can recognize or classify previously unseen classes without any direct labeled examples of those classes in its training data.

Instead of learning from task-specific data, the model leverages auxiliary knowledge (like semantic descriptions or embeddings) to make predictions about new concepts.

How does zero-shot learning work?

ZSL works by transferring knowledge from seen classes to unseen classes using shared information. During training, the model learns a general representation of data (often through pre-training on large datasets).

At inference, it is provided with additional context about the new classes (e.g. textual descriptions or attribute vectors) which situate those classes in a semantic space. The model then matches inputs to the closest class description or embedding, outputting a prediction even though it never saw examples of that class during training.

What is an example of zero-shot learning?

For instance, imagine a vision model trained on images of horses but never on zebras. If told that a “zebra looks like a horse with stripes,” the model can identify zebras by combining its learned concept of a horse with the new “striped” attribute. In NLP, a language model might classify movie reviews as Positive or Negative sentiment without explicit training for sentiment, by leveraging its general language understanding and a prompt that describes the task.

What is zero-shot learning in NLP?

In natural language processing, ZSL enables models to perform new linguistic tasks without task-specific training data. A common example is zero-shot text classification: using a pre-trained NLI (Natural Language Inference) model or large language model to categorize text into new labels by feeding the label names or descriptions as hypotheses or prompts.

For example, Hugging Face’s zero-shot classifier can take a news headline and label it with topics (politics, sports, etc.) by assessing how well the headline “entails” each topic description. This allows text classification, sentiment analysis, topic labeling, etc., on-the-fly for categories the model never explicitly saw during training.

What is the difference between zero-shot and few-shot learning?

Zero-shot learning uses no labeled examples of a target class, relying on auxiliary information and knowledge transfer alone. Few-shot learning (FSL) provides the model with a small number of examples (e.g. 5 or 10) of the new class to learn from. In zero-shot, the model must generalize purely from what it knows about related classes or descriptions, whereas in few-shot the model gets a “hint” via a few training samples.

Few-shot and one-shot approaches still require some labeled data for new classes, making them easier in practice than zero-shot but less flexible in truly novel scenarios. (In short: zero-shot = 0 examples, one-shot = 1 example, few-shot ≈ a handful of examples.)

‍