Zero-Shot Learning: A Practical Guide

Table of contents

Zero-shot learning allows models to recognize classes they were never trained on by using auxiliary knowledge like semantic descriptions or embeddings. It transfers understanding from known to unknown concepts, enabling predictions without labeled examples. Common in NLP and computer vision, it supports tasks like text classification or object recognition using only high-level descriptions of the new categories.

Ideal For:
ML Engineers & Researchers
Reading time:
10 mins
Category:
Models

Share blog post

Here is some key information on what Zero-Shot Learning and how it works.

TL;DR
  • What is zero-shot learning? 

Zero-shot learning (ZSL) is a machine learning approach where a model can recognize or classify previously unseen classes without any direct labeled examples of those classes in its training data​. 

Instead of learning from task-specific data, the model leverages auxiliary knowledge (like semantic descriptions or embeddings) to make predictions about new concepts.

  • How does zero-shot learning work? 

ZSL works by transferring knowledge from seen classes to unseen classes using shared information. During training, the model learns a general representation of data (often through pre-training on large datasets​). 

At inference, it is provided with additional context about the new classes (e.g. textual descriptions or attribute vectors) which situate those classes in a semantic space. The model then matches inputs to the closest class description or embedding, outputting a prediction even though it never saw examples of that class during training​.

  • What is an example of zero-shot learning? 

For instance, imagine a vision model trained on images of horses but never on zebras. If told that a “zebra looks like a horse with stripes,” the model can identify zebras by combining its learned concept of a horse with the new “striped” attribute​. In NLP, a language model might classify movie reviews as Positive or Negative sentiment without explicit training for sentiment, by leveraging its general language understanding and a prompt that describes the task.

  • What is zero-shot learning in NLP?

In natural language processing, ZSL enables models to perform new linguistic tasks without task-specific training data. A common example is zero-shot text classification: using a pre-trained NLI (Natural Language Inference) model or large language model to categorize text into new labels by feeding the label names or descriptions as hypotheses or prompts​. 

For example, Hugging Face’s zero-shot classifier can take a news headline and label it with topics (politics, sports, etc.) by assessing how well the headline “entails” each topic description​. This allows text classification, sentiment analysis, topic labeling, etc., on-the-fly for categories the model never explicitly saw during training.

  • What is the difference between zero-shot and few-shot learning?

Zero-shot learning uses no labeled examples of a target class, relying on auxiliary information and knowledge transfer alone​. Few-shot learning (FSL) provides the model with a small number of examples (e.g. 5 or 10) of the new class to learn from​. In zero-shot, the model must generalize purely from what it knows about related classes or descriptions, whereas in few-shot the model gets a “hint” via a few training samples. 

Few-shot and one-shot approaches still require some labeled data for new classes, making them easier in practice than zero-shot but less flexible in truly novel scenarios​. (In short: zero-shot = 0 examples, one-shot = 1 example, few-shot ≈ a handful of examples.)

What is Zero-Shot Learning

Zero-shot learning (ZSL) is a machine learning technique that enables models to recognize and classify objects, concepts, or categories they have never encountered during training - without needing any labeled examples of those unseen classes.  

Zero-shot learning relies on knowledge transfer from seen to previously unseen classes, allowing the model to generalize beyond its training data.

Unlike traditional supervised learning, which requires extensive labeled data for every category, zero-shot learning leverages auxiliary information-such as semantic attributes, text descriptions, or relationships between classes-to bridge the gap between known (seen) and unknown (unseen) categories.

Figure 1: Context Harvesting.
Figure 1: Context Harvesting.

For example, if a model is trained to recognize cats and dogs but has never seen a zebra, it can still identify a zebra by using a description like "a horse-like animal with black and white stripes". The model connects this new description to its existing knowledge of animals, allowing it to infer the correct label even without direct examples.

Zero-Shot vs. One-Shot vs. Few-Shot: Comparison

Before diving deeper, it’s helpful to contrast zero-shot learning with its close cousins: one-shot learning and few-shot learning, as well as standard supervised learning. 

These all fall under the umbrella of “n-shot learning,” where n refers to the number of examples of each new class provided during training​.

  • Supervised Learning (Traditional): Models need many labeled examples per class and can't recognize classes they haven't seen, e.g., a model trained on dogs and cats won’t recognize a bird.
  • Few-Shot Learning (FSL): Models learn from a handful of labeled examples (like 5–10) per new class, using meta-learning or fine-tuning to adapt quickly.
  • One-Shot Learning (OSL): A special case of few-shot learning where the model is given exactly one labeled example of the new class to learn from. For example, given one photo of a new species of flower, a one-shot model tries to recognize that flower in future images. 
  • Zero-Shot Learning (ZSL): No labeled examples of the target classes are given. The model must rely on external knowledge (attributes, textual descriptions, etc.) and what it learned from other classes to make the leap to unseen classes​. It’s the most extreme case – analogous to identifying an animal species you’ve never seen based purely on reading a description of it. 
Table 1: Few-Shot Learning vs. Zero-Shot Learning.
Aspect Supervised Learning One-Shot Learning Few-Shot Learning Zero-Shot Learning
Examples per class 100s-1M+ 1 Typically 5-100 0
Data dependency High Moderate Low None
Knowledge Source for New Class All knowledge is learned from the labeled training data (no explicit external info) That single example + prior knowledge (often uses transfer learning or similarity to known classes) Those few examples + meta-learning or fine-tuned from pre-trained model Auxiliary info (attributes, descriptions, embeddings); Relies on pre-trained models/knowledge
Best use case Stable environments Rare objects Evolving domains Unseen categories

This spectrum shows a key tradeoff: more examples improve accuracy but reduce flexibility. 

How Does Zero-Shot Learning Models Work?

First, here are some key terms you need to know:

  • Seen Classes: The set of categories the model is trained on, with labeled examples available during training (e.g., cats, dogs, horses).
  • Unseen Classes: Categories that are absent from the training data; the model must generalize to these at test time using indirect evidence.
  • Auxiliary Information: Additional data that describes both seen and unseen classes, such as semantic attributes, textual descriptions, or word embeddings. This information enables the model to connect visual features to new categories it has never encountered.
Figure 2: An overview of how ZSL works. Image by the author.
Figure 2: An overview of how ZSL works. Image by the author.

At the heart of zero-shot learning is the idea of using indirect evidence to make predictions about unseen classes. Since the model has never been trained on the target class, we must answer: how can it recognize something it's never seen?

The solution is to represent classes in terms of a shared semantic space that connects unseen and seen classes through auxiliary information. This typically involves a two-step process:

Training (Learning General Representations)

During the training phase, the model learns to map visual features from seen classes into a shared embedding space where they can be associated with semantic representations. 

As shown in the diagram, a pretrained embedding network (θ) processes labeled images from seen classes (cats, horses, dogs) through supervised learning. The key insight is that this embedding must capture transferable visual features rather than class-specific patterns.

Figure 3: Embedding illustration.
Figure 3: Embedding illustration.

The model simultaneously learns to encode auxiliary information (attributes, textual descriptions, or class prototypes) into the same semantic space using techniques like:

  • Semantic embeddings that map both visual and textual modalities to a common representation, as done by this CVPR 2022 Workshop paper.
Overview of Joint Embeddings for Zero-Shot Learning.
Figure 4: Overview of Joint Embeddings for Zero-Shot Learning.
  • Variational autoencoders that generate latent representations bridging visual and semantic domains, as done in this CVPR 2022 paper.
Figure 5: Variational Autoenconders.
Figure 5: Variational Autoenconders.
  • Multi-attention mechanisms that discover discriminative visual parts guided by semantic descriptions, as done by the SGMA model.
Figure 6: SGMA Framework.
Figure 6: SGMA Framework.

Inference (Transferring to Unseen Classes)

During inference, the magic happens through cross-modal alignment. When presented with an unseen zebra image, the pretrained embedding network extracts visual features without updating its parameters. Simultaneously, auxiliary information ("a horse-like animal with black and white stripes") gets encoded through a text encoder into the same semantic space.

A projection network then performs compatibility matching between these representations. Recent generalized zero-shot learning (GZSL) approaches use sophisticated techniques to:

  • Distinguish between seen and unseen domains using Wasserstein distance and dual variational autoencoders.
  • Generate synthetic examples of unseen classes using class-conditioned generative models, like the U-SAM model which uses StableDiffusion to generate samples for zero-shot segmentation.
Figure 7: U-SAM Architecture Overview.
Figure 7: U-SAM Architecture Overview.
Figure 8: Deep Calibration Network for Zero-Shot Learning.
Figure 8: Deep Calibration Network for Zero-Shot Learning.

The final prediction emerges from finding the unseen class with maximum compatibility in this shared semantic space, effectively enabling the model to "imagine" what a zebra looks like based on its learned understanding of horse-like features combined with stripe patterns.

How to Choose a Zero-Shot Learning Method for your Use Case

The choice between zero-shot learning approaches depends on your specific requirements and constraints. Consider your auxiliary information type (attributes vs. textual descriptions), computational constraints, and whether you need generalized zero-shot learning (recognizing both seen and unseen classes simultaneously) to guide your method selection.

The broad class of methods for ZSL that are popular in the literature include:

1. Classifier-based

Classifier-based methods work best when you have rich semantic descriptions and need interpretable decision boundaries. These methods construct explicit classifiers for unseen classes using correspondence between visual features and semantic prototypes.

Figure 9: U-SAM Method Overview.
Figure 9: U-SAM Method Overview.

The U-SAM framework (CVPR 2025) is a strong example of a classifier-based ZSL method. It extends the Segment Anything Model (SAM) by integrating a semantic classifier that allows the model to recognize and segment novel object categories using only class names as input. 

During training, U-SAM leverages synthetic images generated from text prompts or web-crawled images to provide semantic grounding for each class, bypassing the need for real labeled data. The model employs a multi-layer perceptron (MLP) classifier head that learns to map SAM’s generic mask embeddings to specific object categories, guided by these semantic cues.

Table 2: Traditional Zero-Shot Learning vs. U-SAM Approach.
Aspect Traditional ZSL U-SAM Approach
Input Attributes/text Class names only
Visual Features Pre-defined embeddings SAM's open-world masks
Scalability Fixed class space Dynamic class definitions

What sets U-SAM apart is its ability to perform zero-shot segmentation: at inference, it requires only the class names to predict semantic masks for previously unseen categories and datasets. This approach demonstrates the power and flexibility of classifier-based ZSL, as U-SAM constructs explicit classifiers for new classes based on semantic information rather than visual examples. 

The result is a system that achieves significant improvements over baselines in open-world segmentation tasks, highlighting the effectiveness of combining foundation models with semantic classifier heads for scalable zero-shot recognition.

Figure 10: Qualitative results obtained by U-SAM compared to the typical SAM model which cannot provide semantic labels for the segmented objects.
Figure 10: Qualitative results obtained by U-SAM compared to the typical SAM model which cannot provide semantic labels for the segmented objects.

2. Instance-based

Instance-based methods excel when you have strong visual similarities between seen and unseen classes, as they focus on matching individual instances rather than building explicit classifiers.

For example, SDGZSL introduces a novel method called Semantic Neighborhood Borrowing (SNB). Unlike projection-based or segmentation-focused methods, SNB operates purely at the semantic level by identifying and leveraging implicit relationships between seen and unseen class attributes.

Figure 11: SDG Zero-Shot Learning.
Figure 11: SDG Zero-Shot Learning.

The method constructs a semantic graph where nodes represent classes (both seen and unseen) and edges encode attribute similarity. For each unseen class, SNB identifies its k-nearest semantic neighbors among seen classes using attribute cosine similarity. 

During training, the model borrows instance features from these neighboring seen classes to build a prototype for the unseen class, weighted by their semantic proximity. This allows the model to create pseudo-instances for unseen classes without synthetic feature generation or visual comparisons.

Figure 12: Results obtained by SDGZSL.
Figure 12: Results obtained by SDGZSL.

On the AWA2 benchmark, SNB achieved 72.1% accuracy on unseen classes (+8.3% over prior instance-borrowing methods) while maintaining 64.9% accuracy on seen classes in generalized ZSL settings. This demonstrates how instance-based methods can effectively leverage semantic relationships without requiring visual similarity assumptions or synthetic data pipelines.

3. Generative Approaches

Generative approaches using GANs or VAEs are ideal when you need robust performance across diverse unseen classes, as they can synthesize training examples for better generalization.

For example, ZeroDiff (ICLR 2025) introduces a generative zero-shot learning framework designed to overcome the problem of spurious visual-semantic correlations, which become especially pronounced when training data is scarce. ZeroDiff employs a diffusion-based generative model that augments limited seen-class data by generating a wide variety of noised samples. 

This diffusion augmentation helps prevent overfitting and enables the generative model to better capture the underlying distribution of visual features, even with just a fraction of the original training data.

Figure 13: Standard GAN-based approach vs ZeroDiff.
Figure 13: Standard GAN-based approach vs ZeroDiff.

A key innovation in ZeroDiff is its use of supervised-contrastive (SC) representations, which dynamically encode the distinctive characteristics of each training sample. These representations guide the generation of visual features, ensuring that the synthesized data remains semantically meaningful and diverse.

Figure 14: Training pipeline of DFG.
Figure 14: Training pipeline of DFG.

Additionally, ZeroDiff employs multiple discriminators, each evaluating generated features from a different perspective - predefined semantics, SC-based representations, and the diffusion process itself. These discriminators are trained using a Wasserstein-distance-based mutual learning approach, which further solidifies the alignment between visual and semantic information. As a result, ZeroDiff achieves robust zero-shot learning performance, outperforming previous generative methods and maintaining high accuracy even when only 10% of the seen-class training data is available.

Figure 15: Performance comparison on limted training data.
Figure 15: Performance comparison on limted training data.

See Lightly in Action

Curate data, train foundation models, deploy on edge today.

Book a demo

Techniques and Types of Zero-Shot Learning

Zero-shot learning is a problem setup, not a single algorithm. 

Over the years, researchers have devised various strategies to enable models to make zero-shot predictions. Here we outline the major approaches, each with its strengths and typical use cases. 

Often, modern ZSL systems combine elements of these methods.

Table 3: Types of Zero-Shot Learning.
Type Key Techniques Strengths Ideal Use Cases
Attribute-Based Handcrafted attribute vectors High interpretability Fine-grained classification
Semantic Embeddings Word2Vec, CLIP, visual clustering Scalability to large class spaces Cross-modal retrieval
Generative GANs, diffusion models, evolutionary search Robust to data scarcity Medical imaging, rare object detection

Attribute-Based Methods

Attribute-based zero-shot learning (AZSL) represents one of the earliest and most interpretable approaches to ZSL. It relies on human-defined semantic attributes- mid-level features like "striped," "winged," or "aquatic" - to bridge seen and unseen classes. 

These attributes act as a shared vocabulary that connects known and unknown categories through logical combinations.

Figure 16: Attributes examples.
Figure 16: Attributes examples.

How It Works:

  1. Training Phase: Models learn to detect predefined attributes in labeled examples of seen classes (e.g., recognizing "stripes" in tiger images). Each class is represented as a binary attribute vector (e.g., zebra = [stripes: 1, quadruped: 1, aquatic: 0]).
  2. Inference Phase: For unseen classes, the model predicts attributes in test instances and matches them to the target class’s predefined attribute profile. For example, detecting "yellow + striped + flying" triggers a "bee" prediction, even if bee images were absent during training.

Strengths:

  • Interpretability: Decisions are traceable to detected attributes (e.g., "zebra" prediction stems from stripes and four legs).
  • Fine-Grained Recognition: Effective for distinguishing subtle differences, such as bird species with overlapping features.

Limitations:

  • Fixed Attribute Sets: Fails for classes with dynamic traits (e.g., American Goldfinch’s seasonal plumage changes)
  • Labeling Costs: Requires domain experts to define and annotate attributes (e.g., 85 attributes for 50 animal classes in the AWA dataset)
  • Missing Attributes: Cannot recognize novel attributes outside the predefined vocabulary (e.g., "bioluminescent" in marine organisms)
  • Hubness Problem: Attribute clusters bias predictions (e.g., overpredicting "striped" classes in wildlife datasets)
https://openaccess.thecvf.com/content/CVPR2022/papers/Xu_VGSE_Visually-Grounded_Semantic_Embeddings_for_Zero-Shot_Learning_CVPR_2022_paper.pdf
Figure 17: Human based attributes identified by annotators vs. semantic embeddings discovered via clusters.

To address these limitations, recent work combines attribute-based reasoning with automated techniques:

  • VGSE Network: Discovers latent visual attributes via clustering, achieving 76.3% accuracy on CUB without manual annotations.
Figure 18: Patch Clustering and Class Relation Modules.
Figure 18: Patch Clustering and Class Relation Modules.
  • Attribute Correlation Modeling: Frameworks like ZSL-KG encode relationships between attributes (e.g., "winged → flies") to improve generalization.
Figure 19: Attribute correlation module.
Figure 19: Attribute correlation module.
  • Semi-Supervised Annotation: Tools like CLIP-Attr generate candidate attributes from web-crawled text, reducing labeling effort by 60%.
Figure 20: Semi-Supervised Annotation.

Semantic Embedding & Similarity-Based Methods

Semantic embedding methods have revolutionized zero-shot learning (ZSL) by automating the alignment of visual and textual modalities. Instead of relying on handcrafted attributes, these approaches project images and class descriptions into a shared high-dimensional space where similarity metrics drive predictions.

Core Principles
  1. Shared Embedding Space:
    • Visual features (e.g., from CNNs) and semantic descriptors (e.g., Word2Vec, CLIP text embeddings) are mapped to a common space.
    • Distance metrics like cosine similarity measure compatibility between image embeddings and class prototypes.
  2. Contrastive Learning: Models like CLIP train image and text encoders jointly, maximizing similarity for matched pairs (e.g., "zebra" image + "striped equine" text) while minimizing it for mismatched pairs. This creates a unified space where visual and semantic concepts align naturally.
  3. Automated Semantic Discovery: Frameworks like VGSE liminate manual attribute engineering by clustering image patches into visual concepts (e.g., "striped texture") and linking them to class names via word embeddings.
Key innovations:

1. CLIP (Contrastive Language-Image Pretraining):

Trained on 400M image-text pairs, CLIP achieves 98.7% zero-shot accuracy on tasks like Imagenette by reformatting class labels into prompts (e.g., "a photo of a {label}"). Its vision-text alignment enables open-world recognition without fine-tuning.

Figure 21: CLIP.
Figure 21: CLIP.

DeViSE (Deep Visual-Semantic Embedding):

Pioneered cross-modal alignment by fine-tuning CNNs to predict Word2Vec vectors of class names, enabling 4.5× better generalization to unseen classes compared to traditional classifiers.

Figure 22: DeViSE.
Table 4: Semantic Embedding and Similarity-Based Methods summary.
Method Key Technique Advantages Limitations
CLIP Vision-text contrastive pretraining No fine-tuning needed; scales to 1M+ classes Sensitive to prompt engineering
VGSE Automated visual cluster discovery Eliminates attribute engineering; interpretable Requires class name word2vec
DeViSE CNN-to-Word2Vec projection Early proof of cross-modal transfer Limited to fixed vocabularies

Generative Methods (Synthesizing Data for Unseen Classes)

Figure 23: Generative Methods Overview.
Figure 23: Generative Methods Overview.

Generative methods address zero-shot learning (ZSL) by synthesizing artificial examples of unseen classes, effectively converting the problem into a supervised learning task. 

These approaches use auxiliary information - such as textual descriptions, class attributes, or semantic embeddings - to guide the generation of plausible visual features or images for unseen categories.

Core Methodology
  1. Data Synthesis: Models like GANs, VAEs, or diffusion networks generate synthetic samples for unseen classes conditioned on their semantic descriptions. For example, a zebra might be synthesized using the prompt "striped horse-like animal."
  2. Supervised Training: The generated data trains a classifier to recognize both seen and unseen classes, mitigating bias toward seen categories in generalized ZSL (GZSL).
Key Generative Approaches:
1. Generative Adversarial Networks (GANs)
  • Mechanism: A generator creates synthetic features from semantic vectors, while a discriminator distinguishes real vs. fake samples.
  • Example: The Feature Generating Networks (CVPR 2018) used Wasserstein GANs to synthesize CNN features for unseen classes, achieving 68.3% accuracy on CUB - 19% higher than non-generative methods.
  • Advantages: High-fidelity feature generation; effective for fine-grained tasks.
  • Limitations: Prone to mode collapse; unstable training dynamics.
2. Variational Autoencoders (VAEs)
  • Mechanism: Encodes seen-class data into latent distributions, then decodes unseen-class features using semantic vectors.
  • Example: GenZSL (2024) introduced an inductive VAE that generates samples for unseen classes by extrapolating from similar seen categories, improving diversity via class-specific attention. Achieved 74.2% accuracy on AWA2 with CLIP text embeddings.
  • Advantages: Probabilistic framework; better at capturing data variability.
  • Limitations: Blurry outputs compared to GANs.
3. Diffusion Models
Figure 24: Overview of the StableDiffusion model.
Figure 24: Overview of the StableDiffusion model.
  • Mechanism: Iteratively denoises random vectors into samples aligned with semantic descriptions.
  • Example: ZeroDiff (ICLR 2025) combined diffusion augmentation with supervised-contrastive learning to solidify visual-semantic correlations. Achieved 76.3% accuracy on CUB with 90% less training data.
  • Advantages: High sample diversity; robust to data scarcity.
  • Limitations: Computationally intensive; requires careful noise scheduling.
Figure 25: Examples of images generated using StableDiffusion.
Figure 25: Examples of images generated using StableDiffusion.

Evaluating ZSL models requires metrics that go beyond standard accuracy to reflect performance on unseen classes. 

Key metrics include:
  • Top-K Accuracy: Checks if the true label is among the top K predictions. Useful for tasks with similar classes, like species recognition.
  • Harmonic Mean: Balances accuracy on seen and unseen classes in generalized ZSL, avoiding bias toward familiar categories.
  • Mean Class Accuracy: Averages accuracy per class to ensure rare and common classes are treated equally, helpful for imbalanced datasets.
  • AUC (Area Under the Curve): Captures the trade-off between true and false positives, especially valuable in domains like medical diagnostics.
Table 5: Metrics for generative methods.
Metric Use Case Advantage
Top-1 Accuracy Standard ZSL (unseen classes only) Simple interpretation
Harmonic Mean Generalized ZSL Balances seen/unseen bias
Mean Class Acc. Class-imbalanced datasets Fair evaluation of rare classes
AUC Medical/security applications Robust to skewed class distributions

No single metric captures all aspects of ZSL performance. Researchers often report multiple metrics to provide a comprehensive view, with Top-1 accuracy and harmonic mean being the most widely adopted benchmarks. .

Applications of Zero-Shot Learning in Computer Vision

Below, we highlight some of the most impactful applications of ZSL in Computer Vision.

Image Classification with Unseen Categories: ZSL enables models to recognize new categories without labeled images by relying on semantic descriptions (e.g., attributes or text). This is useful in fields like wildlife monitoring or medical imaging, where new or rare classes may not have training data.

Figure 26: Visualizing Word to Image Attention.
Figure 26: Visualizing Word to Image Attention.
  • Zero-Shot Object Detection: Extends ZSL to localizing unseen objects using semantic embeddings. Helpful in scenarios like autonomous driving, where models must detect unfamiliar items (e.g., “scooter”) based on descriptions.
Figure 27: Zero-Shot Object Detection.
Figure 27: Zero-Shot Object Detection.
Pro tip: Read our Introduction to ViT (Vision Transformers): Everything You Need to Know.
  • Image Retrieval and Captioning: Models like CLIP use joint image-text embeddings to retrieve or caption images based on unseen queries. For example, searching for "red sporty two-seater car" retrieves relevant images without specific training.
Figure 28: Image Retrieval and Captioning.
Figure 28: Image Retrieval and Captioning.
  • Generalized Zero-Shot Learning (GZSL) in the Wild: Real-world tasks often mix seen and unseen classes. GZSL evaluates models on both, using techniques like novelty detection and calibration to avoid bias toward known classes.
  • Visual Question Answering (VQA): ZSL enables models to answer questions about unfamiliar concepts in images by combining visual features with language understanding.
Figure 29: Visual Question Answering.
Figure 29: Visual Question Answering.
  • Generative Visual Tasks and Style Transfer: Text-to-image models like DALL·E show zero-shot abilities by generating images of novel concepts from descriptions, combining known elements in creative ways.
Figure 30: GenerativeVisual Tasks and Style Transfer.
Pro tip: Check out The Engineer's Guide to Large Vision Models.

Key Research Milestones and Future Directions

In this section, we briefly note some landmark papers and what they contributed, as well as where the field is heading.

  • 2008–2010: The idea of zero-shot learning was first explored under different names: Dataless Classification in NLP (Chang et al., 2008) and Zero-Data Learning in CV around the same time​. The term “zero-shot learning” itself was popularized by a NeurIPS 2009 paper by Palatucci et al.

  • 2013–2015: These years saw a growth of attribute-based models (e.g., Farhadi et al. on “describing objects by attributes” and Lampert et al. on “Attribute-based classification for unseen classes”). Also, word embeddings like Word2Vec (2013) gave a new way to get semantic vectors for class labels. So papers started using vector representations of class names instead of manually defined attributes. This automated a lot of the knowledge extraction (e.g., Socher et al. 2013 used lexical databases to do zero-shot image classification).

  • 2017: Xian et al.’s Comprehensive Study – Xian, Lampert, et al. published “Zero-Shot Learning – The Good, the Bad and the Ugly”​​, which was a thorough evaluation of ZSL methods. They introduced consistent splits for seen/unseen classes to avoid test contamination, and emphasized generalized zero-shot learning as a more realistic scenario.

  • 2018–2020: With GANs becoming popular, researchers like Xian et al. started using GANs to generate unseen class features. Models like f-VAEGAN-D2 combined VAEs and GANs to set new state-of-the-art on ZSL benchmarks​. Work also expanded ZSL into new areas like zero-shot action recognition (for videos) and zero-shot semantic segmentation (labeling pixels of unseen object classes in an image).

  • 2021: OpenAI’s CLIP​ demonstrated that a single model can achieve impressive zero-shot classification on ImageNet and other datasets by learning from natural language supervision at scale. At the same time, GPT-3 (Brown et al.) showed the enormous zero-shot and few-shot capabilities of language models, sparking the prompt engineering revolution in NLP​.

  • 2022 and beyond: The line between zero-shot and general AI capabilities is blurring. Large foundation models (like GPT-4, CLIP derivatives, multi-modal transformers) are increasingly capable of what we might call “zero-shot learning” – though one could argue they implicitly learned those tasks during pretraining. Research is focusing on:
    • Prompt engineering and tuning
    • Improving reasoning
    • Continual Learning with Zero-Shot

Conclusion

Zero-shot learning has moved from theory to real-world impact, enabling models to handle new tasks without labeled data by leveraging semantic understanding. It’s already transforming fields like vision, language, and multimodal AI.

While challenges like bias, semantic gaps, and reliance on quality descriptions remain, advances in foundation models and alignment techniques are rapidly improving ZSL’s reliability.

As AI systems grow more adaptable and general, zero-shot learning will be key to building flexible, data-efficient solutions for the future.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Stay ahead in computer vision

Get exclusive insights, tips, and updates from the Lightly.ai team.