Zero-shot learning allows models to recognize classes they were never trained on by using auxiliary knowledge like semantic descriptions or embeddings. It transfers understanding from known to unknown concepts, enabling predictions without labeled examples. Common in NLP and computer vision, it supports tasks like text classification or object recognition using only high-level descriptions of the new categories.
Here is some key information on what Zero-Shot Learning and how it works.
Zero-shot learning (ZSL) is a machine learning approach where a model can recognize or classify previously unseen classes without any direct labeled examples of those classes in its training data.
Instead of learning from task-specific data, the model leverages auxiliary knowledge (like semantic descriptions or embeddings) to make predictions about new concepts.
ZSL works by transferring knowledge from seen classes to unseen classes using shared information. During training, the model learns a general representation of data (often through pre-training on large datasets).
At inference, it is provided with additional context about the new classes (e.g. textual descriptions or attribute vectors) which situate those classes in a semantic space. The model then matches inputs to the closest class description or embedding, outputting a prediction even though it never saw examples of that class during training.
For instance, imagine a vision model trained on images of horses but never on zebras. If told that a “zebra looks like a horse with stripes,” the model can identify zebras by combining its learned concept of a horse with the new “striped” attribute. In NLP, a language model might classify movie reviews as Positive or Negative sentiment without explicit training for sentiment, by leveraging its general language understanding and a prompt that describes the task.
In natural language processing, ZSL enables models to perform new linguistic tasks without task-specific training data. A common example is zero-shot text classification: using a pre-trained NLI (Natural Language Inference) model or large language model to categorize text into new labels by feeding the label names or descriptions as hypotheses or prompts.
For example, Hugging Face’s zero-shot classifier can take a news headline and label it with topics (politics, sports, etc.) by assessing how well the headline “entails” each topic description. This allows text classification, sentiment analysis, topic labeling, etc., on-the-fly for categories the model never explicitly saw during training.
Zero-shot learning uses no labeled examples of a target class, relying on auxiliary information and knowledge transfer alone. Few-shot learning (FSL) provides the model with a small number of examples (e.g. 5 or 10) of the new class to learn from. In zero-shot, the model must generalize purely from what it knows about related classes or descriptions, whereas in few-shot the model gets a “hint” via a few training samples.
Few-shot and one-shot approaches still require some labeled data for new classes, making them easier in practice than zero-shot but less flexible in truly novel scenarios. (In short: zero-shot = 0 examples, one-shot = 1 example, few-shot ≈ a handful of examples.)
Zero-shot learning (ZSL) is a machine learning technique that enables models to recognize and classify objects, concepts, or categories they have never encountered during training - without needing any labeled examples of those unseen classes.
Zero-shot learning relies on knowledge transfer from seen to previously unseen classes, allowing the model to generalize beyond its training data.
Unlike traditional supervised learning, which requires extensive labeled data for every category, zero-shot learning leverages auxiliary information-such as semantic attributes, text descriptions, or relationships between classes-to bridge the gap between known (seen) and unknown (unseen) categories.
For example, if a model is trained to recognize cats and dogs but has never seen a zebra, it can still identify a zebra by using a description like "a horse-like animal with black and white stripes". The model connects this new description to its existing knowledge of animals, allowing it to infer the correct label even without direct examples.
Before diving deeper, it’s helpful to contrast zero-shot learning with its close cousins: one-shot learning and few-shot learning, as well as standard supervised learning.
These all fall under the umbrella of “n-shot learning,” where n refers to the number of examples of each new class provided during training.
This spectrum shows a key tradeoff: more examples improve accuracy but reduce flexibility.
First, here are some key terms you need to know:
At the heart of zero-shot learning is the idea of using indirect evidence to make predictions about unseen classes. Since the model has never been trained on the target class, we must answer: how can it recognize something it's never seen?
The solution is to represent classes in terms of a shared semantic space that connects unseen and seen classes through auxiliary information. This typically involves a two-step process:
During the training phase, the model learns to map visual features from seen classes into a shared embedding space where they can be associated with semantic representations.
As shown in the diagram, a pretrained embedding network (θ) processes labeled images from seen classes (cats, horses, dogs) through supervised learning. The key insight is that this embedding must capture transferable visual features rather than class-specific patterns.
The model simultaneously learns to encode auxiliary information (attributes, textual descriptions, or class prototypes) into the same semantic space using techniques like:
During inference, the magic happens through cross-modal alignment. When presented with an unseen zebra image, the pretrained embedding network extracts visual features without updating its parameters. Simultaneously, auxiliary information ("a horse-like animal with black and white stripes") gets encoded through a text encoder into the same semantic space.
A projection network then performs compatibility matching between these representations. Recent generalized zero-shot learning (GZSL) approaches use sophisticated techniques to:
The final prediction emerges from finding the unseen class with maximum compatibility in this shared semantic space, effectively enabling the model to "imagine" what a zebra looks like based on its learned understanding of horse-like features combined with stripe patterns.
The choice between zero-shot learning approaches depends on your specific requirements and constraints. Consider your auxiliary information type (attributes vs. textual descriptions), computational constraints, and whether you need generalized zero-shot learning (recognizing both seen and unseen classes simultaneously) to guide your method selection.
The broad class of methods for ZSL that are popular in the literature include:
Classifier-based methods work best when you have rich semantic descriptions and need interpretable decision boundaries. These methods construct explicit classifiers for unseen classes using correspondence between visual features and semantic prototypes.
The U-SAM framework (CVPR 2025) is a strong example of a classifier-based ZSL method. It extends the Segment Anything Model (SAM) by integrating a semantic classifier that allows the model to recognize and segment novel object categories using only class names as input.
During training, U-SAM leverages synthetic images generated from text prompts or web-crawled images to provide semantic grounding for each class, bypassing the need for real labeled data. The model employs a multi-layer perceptron (MLP) classifier head that learns to map SAM’s generic mask embeddings to specific object categories, guided by these semantic cues.
What sets U-SAM apart is its ability to perform zero-shot segmentation: at inference, it requires only the class names to predict semantic masks for previously unseen categories and datasets. This approach demonstrates the power and flexibility of classifier-based ZSL, as U-SAM constructs explicit classifiers for new classes based on semantic information rather than visual examples.
The result is a system that achieves significant improvements over baselines in open-world segmentation tasks, highlighting the effectiveness of combining foundation models with semantic classifier heads for scalable zero-shot recognition.
Instance-based methods excel when you have strong visual similarities between seen and unseen classes, as they focus on matching individual instances rather than building explicit classifiers.
For example, SDGZSL introduces a novel method called Semantic Neighborhood Borrowing (SNB). Unlike projection-based or segmentation-focused methods, SNB operates purely at the semantic level by identifying and leveraging implicit relationships between seen and unseen class attributes.
The method constructs a semantic graph where nodes represent classes (both seen and unseen) and edges encode attribute similarity. For each unseen class, SNB identifies its k-nearest semantic neighbors among seen classes using attribute cosine similarity.
During training, the model borrows instance features from these neighboring seen classes to build a prototype for the unseen class, weighted by their semantic proximity. This allows the model to create pseudo-instances for unseen classes without synthetic feature generation or visual comparisons.
On the AWA2 benchmark, SNB achieved 72.1% accuracy on unseen classes (+8.3% over prior instance-borrowing methods) while maintaining 64.9% accuracy on seen classes in generalized ZSL settings. This demonstrates how instance-based methods can effectively leverage semantic relationships without requiring visual similarity assumptions or synthetic data pipelines.
Generative approaches using GANs or VAEs are ideal when you need robust performance across diverse unseen classes, as they can synthesize training examples for better generalization.
For example, ZeroDiff (ICLR 2025) introduces a generative zero-shot learning framework designed to overcome the problem of spurious visual-semantic correlations, which become especially pronounced when training data is scarce. ZeroDiff employs a diffusion-based generative model that augments limited seen-class data by generating a wide variety of noised samples.
This diffusion augmentation helps prevent overfitting and enables the generative model to better capture the underlying distribution of visual features, even with just a fraction of the original training data.
A key innovation in ZeroDiff is its use of supervised-contrastive (SC) representations, which dynamically encode the distinctive characteristics of each training sample. These representations guide the generation of visual features, ensuring that the synthesized data remains semantically meaningful and diverse.
Additionally, ZeroDiff employs multiple discriminators, each evaluating generated features from a different perspective - predefined semantics, SC-based representations, and the diffusion process itself. These discriminators are trained using a Wasserstein-distance-based mutual learning approach, which further solidifies the alignment between visual and semantic information. As a result, ZeroDiff achieves robust zero-shot learning performance, outperforming previous generative methods and maintaining high accuracy even when only 10% of the seen-class training data is available.
Zero-shot learning is a problem setup, not a single algorithm.
Over the years, researchers have devised various strategies to enable models to make zero-shot predictions. Here we outline the major approaches, each with its strengths and typical use cases.
Often, modern ZSL systems combine elements of these methods.
Attribute-based zero-shot learning (AZSL) represents one of the earliest and most interpretable approaches to ZSL. It relies on human-defined semantic attributes- mid-level features like "striped," "winged," or "aquatic" - to bridge seen and unseen classes.
These attributes act as a shared vocabulary that connects known and unknown categories through logical combinations.
To address these limitations, recent work combines attribute-based reasoning with automated techniques:
Semantic embedding methods have revolutionized zero-shot learning (ZSL) by automating the alignment of visual and textual modalities. Instead of relying on handcrafted attributes, these approaches project images and class descriptions into a shared high-dimensional space where similarity metrics drive predictions.
1. CLIP (Contrastive Language-Image Pretraining):
Trained on 400M image-text pairs, CLIP achieves 98.7% zero-shot accuracy on tasks like Imagenette by reformatting class labels into prompts (e.g., "a photo of a {label}"). Its vision-text alignment enables open-world recognition without fine-tuning.
DeViSE (Deep Visual-Semantic Embedding):
Pioneered cross-modal alignment by fine-tuning CNNs to predict Word2Vec vectors of class names, enabling 4.5× better generalization to unseen classes compared to traditional classifiers.
Generative methods address zero-shot learning (ZSL) by synthesizing artificial examples of unseen classes, effectively converting the problem into a supervised learning task.
These approaches use auxiliary information - such as textual descriptions, class attributes, or semantic embeddings - to guide the generation of plausible visual features or images for unseen categories.
Evaluating ZSL models requires metrics that go beyond standard accuracy to reflect performance on unseen classes.
No single metric captures all aspects of ZSL performance. Researchers often report multiple metrics to provide a comprehensive view, with Top-1 accuracy and harmonic mean being the most widely adopted benchmarks. .
Below, we highlight some of the most impactful applications of ZSL in Computer Vision.
Image Classification with Unseen Categories: ZSL enables models to recognize new categories without labeled images by relying on semantic descriptions (e.g., attributes or text). This is useful in fields like wildlife monitoring or medical imaging, where new or rare classes may not have training data.
Pro tip: Read our Introduction to ViT (Vision Transformers): Everything You Need to Know.
Pro tip: Check out The Engineer's Guide to Large Vision Models.
In this section, we briefly note some landmark papers and what they contributed, as well as where the field is heading.
Zero-shot learning has moved from theory to real-world impact, enabling models to handle new tasks without labeled data by leveraging semantic understanding. It’s already transforming fields like vision, language, and multimodal AI.
While challenges like bias, semantic gaps, and reliance on quality descriptions remain, advances in foundation models and alignment techniques are rapidly improving ZSL’s reliability.
As AI systems grow more adaptable and general, zero-shot learning will be key to building flexible, data-efficient solutions for the future.
Get exclusive insights, tips, and updates from the Lightly.ai team.