This guide explores few-shot learning—a method where models learn from just a few examples. It explains how FSL differs from traditional supervised learning, the nuances between few-shot, one-shot, and zero-shot learning, and how techniques like meta-learning and prompting enable rapid generalization.
What is few-shot learning?
Few-shot learning (FSL) is a machine learning approach where models learn to make accurate predictions given only a very small number of labeled examples per class. In essence, the model must generalize to new classes or tasks using just a handful of training samples, mimicking how humans can learn from only a few examples. Few-shot learning requires a minimal amount of labeled data to train models effectively, enabling them to adapt to new tasks swiftly.
Each data sample in few-shot learning is crucial as it serves as an anchor or point of comparison to determine similarities and perform classifications.
How is few-shot learning different from traditional supervised learning?
In traditional supervised learning, models are trained on hundreds or thousands of labeled examples for each class, whereas few-shot learning uses only a few (e.g. 1, 5, or 10) examples per class. This makes FSL feasible in scenarios where data collection or labeling is expensive or impractical (e.g. medical images or rare classes), but it also means FSL models must avoid overfitting and leverage prior knowledge to succeed.
What’s the difference between few-shot, one-shot, and zero-shot learning?
Few-shot learning typically means 2–5 training examples per class (K=2..5). One-shot learning is a special case of FSL with just one example per class. Zero-shot learning means the model gets no training examples of some target classes – it must recognize new classes using zero labeled examples, often by relying on other information like class descriptions or pre-existing knowledge. One-shot is essentially an extreme case of few-shot, while zero-shot is a distinct problem requiring different techniques (like using semantic embeddings or prompts).
How does few-shot learning work?
Few-shot learning algorithms usually rely on prior experience or knowledge to compensate for limited data. In practice, many FSL methods use meta-learning (“learning to learn”) across many tasks: a model is trained on a variety of small training tasks so that it can quickly adapt to a new task with only a few examples. During training, episodes are formed with an N-way K-shot paradigm (see below), using a small support set of examples and evaluating on a query set to mimic few-shot conditions. This episodic training teaches the model to rapidly generalize from few samples. Other approaches involve transfer learning (fine-tuning a pre-trained model on the few examples) or metric learning (learning an embedding space where samples can be compared by a distance function).
Why is few-shot learning important?
Few-shot learning is important because in many real-world cases, collecting large labeled datasets is difficult, costly, or slow. FSL enables ML models to handle scenarios with limited training data, such as identifying a rare disease from only a handful of medical images, recognizing a newly discovered species of animal with just a few photos, or personalizing a vision or language model to a new user’s data quickly. By allowing models to learn from few labeled samples, FSL can dramatically reduce data collection and annotation costs and enable rapid learning for new classes or tasks that were not present in the original training data.
How do large language models use few-shot learning?
Large Language Models (LLMs) like GPT-3 and GPT-4 have demonstrated in-context few-shot learning, often called few-shot prompting. Instead of updating the model’s parameters, we provide a prompt with a few example input-output pairs (for instance, a few sentences and their sentiment labels) and then ask the model to perform the task on a new input. The model leverages its prior knowledge (from training on massive text corpora) to generalize from those examples and produce an answer.
With the advent of foundational models and agents, the motivation for (full) transfer learning models has been decreasing. Instead, increasingly fine-tuning and few-shot learning are becoming more relevant. With the ability to tune a model on small high quality datasets one can quickly adapt state-of-the-art off-the-shelf models.
In this article we’ll explore:
And If you're working on few-shot or low-data training pipelines and looking for tools to make your data work harder, make sure to check out:
Few-shot learning (FSL) is a machine learning technique that enables a model to generalize to new tasks or classes using only a few labeled examples. In a typical few-shot setting, we might have only K examples of each new class (where K is small – e.g. 1, 2, or 5). This is in stark contrast to standard supervised learning, which usually needs a large dataset of examples per class.
Check out this video to learn more.
Formally, few-shot learning is often described in the context of “N-way K-shot” or simply K-shot classification: the model must discriminate between N classes, given just K training examples of each class (the support set), and then classify new examples (the query set) accordingly.
A typical fine-tuning workflow involves training a section of the model (typically the last few layers) on a small dataset. Compared to pretraining, fine-tuning datasets are typically more than 100x smaller. However, with the advent of bigger foundational models, it’s becoming harder to even fine-tune models on fairly decently sized fine-tuning datasets.
Few-Shot learning has emerged as a viable alternative wherein we attempt to tune models with even fewer samples (10-50 per class) and achieve convincing performance on most tasks (termed few shot tasks).
Pro Tip: Read Pretraining vs. Fine-tuning: What Are the Differences?
An extreme case of few-shot learning is zero-shot learning wherein we evaluate models on new tasks and classes without tuning the model on any new classes. For example, if we take a pre-trained ImageNet model and evaluate its performance on classifying pictures of various types of faults in cement tubes.
This is a case of zero-shot evaluation, since we’re evaluating the model’s performance on classification tasks without training the model on any images of cement tubes.
Moreover, with the shift from fully-supervised pre-training to self and semi-supervised contrastive learning, we have observed a general shift wherein models are becoming increasingly capable of zero-shot adaptation to novel tasks.
However, these adaptations have been limited to simple tasks such as classification. For more critical use cases, one still needs to perform a few-shot fine-tuning!
The most common way, almost all of us perform few-shot learning is when we provide some examples to ChatGPT in the hopes of it being able to replicate them. For instance, you might ask ChatGPT:
“Given these input-output pairs, classify the given statement as positive or negative
Input: This is terrible. Output: Negative
Input: This is so good. Output: Positive
Input: This doesn’t work. Output:”
This involves feeding the model some examples in the instruction itself with the intent of the model identifying the pattern with the given data points.
A very important use case of Few Shot Learning is the ability to tune a model to learn about very rare classes from selective samples with little information and computation. This could be really crucial for fine grained visual classification (identifying a specific disease on plant leaves) or medical imaging where the availability of labeled data of various classes is highly limited or expensive to annotate.
Some important characteristics of few-shot learning include:
Let’s take the example of classification, typically we aim to approximate a function such that it is able to classify the vast majority of data points from the training dataset accurately. This is generally possible when the dataset has sufficient training samples.
However in the case of few-shot classification, this becomes an increasingly harder problem since the number of examples is now significantly lower and therefore it’s much harder to attain higher performance on test sets within some generalization bounds.
Few-shot learning is especially significant in the field of computer vision, where obtaining large labeled data samples can be extremely labor-intensive and costly.
Some reasons why FSL is valuable in computer vision:
So, how do we actually get a model to learn from a few examples?
Over the years, a variety of approaches have been developed for few-shot learning. We can categorize the key approaches into a few broad strategies, each exploiting a different idea:
Many practical algorithms combine elements of these strategies. Below, we outline each approach.
Unlike traditional models that learn specific tasks, meta-learning algorithms are designed to acquire the learning process itself, enabling rapid adaptation to novel tasks with minimal data.
Instead of optimizing for performance on a single task, these training tasks optimize for the ability to learn efficiently across a distribution of tasks. This is typically achieved through a two-tiered learning process: an outer loop that learns how to learn (the meta-learning phase) and an inner loop that applies this learning strategy to specific tasks. This nested optimization allows models to extract task-agnostic learning strategies that transfer effectively to unseen problems.
Model-Agnostic Meta-Learning (MAML), introduced by Finn et al., exemplifies this approach by finding parameter initializations that can be rapidly adapted to new tasks with just a few gradient updates. MAML trains the model parameters to serve as a starting point from which minimal fine-tuning can yield optimal performance across diverse tasks.
This approach typically begins with a pretrained model that has already learned robust feature representations from a large dataset. For example, models like ResNet or BERT, trained on ImageNet or massive text corpora respectively, develop comprehensive representations of their domains. These models encode abstractions – edges and textures in vision or syntactic patterns and semantic concepts in language.
Rather than training the entire network from scratch on limited examples, which would likely lead to overfitting, fine-tuning strategically updates select portions of the model while preserving the knowledge embedded or learnt in other layers. This technique frequently involves "freezing" early layers that capture universal features while updating later layers to specialize in the target task.
Based on the simple task of learning a distance function over the various data samples, Metric Learning based approaches provide an easy to deploy solution that allows for fast inference times. Continuing with the example of few shot image classification, given two samples of images, we aim to minimize the distance between these samples if they belong to the same class or maximise if they belong to different classes.
Even this simple training task has proven to work well in the literature. It is however known to be less adaptive to optimization in dynamic environments and grows linearly in complexity wrt the test size.
In the case of images, since the data is so high dimensional rather than optimizing the distance between raw images (largely unstructured data), we aim to minimize the distance between embeddings (structured latent space) generated from pre-trained image encoder models such as ResNet50.
These distance measures help us understand how “close” or “far apart” two points are in an embedding space. The choice of distance measure can significantly impact how our models learn and perform. Some common distance measures are:
Traditional data augmentation techniques apply domain-specific transformations to existing examples, creating variations of images of the same class that preserve class identity while introducing meaningful diversity. For instance in computer vision, these transformations include rotation, scaling, cropping, and color jittering, which simulate natural variations in object appearance. On the other hand for natural language processing, operations like random insertion, deletion, and word order swapping introduce linguistic variations while maintaining semantic content.
For few-shot learning, conditional GANs (Generative Adversarial Networks) can generate diverse examples of rare classes by learning from the limited number of examples. Similarly, Variational Autoencoders (VAEs) learn a continuous latent space representation of data that can be sampled to create novel examples, effectively interpolating between known samples.
These models not only generate novel examples but do so in ways that specifically aid learning from fewer samples. Diffusion models, which have recently gained prominence in image generation tasks, offer another promising direction for few-shot learning.
These models, which gradually denoise random Gaussian noise into coherent data samples, can be fine-tuned on small datasets to generate class-specific examples. Their ability to capture complex data distributions makes them particularly well-suited for creating diverse, high-quality synthetic data in low-resource scenarios.
Few-shot, one-shot, and zero-shot learning are related concepts often mentioned together. The table below provides a quick comparison of these learning paradigms:
Few-shot learning unlocks a variety of applications in computer vision and related fields. Here are some notable use cases, along with how few-shot techniques are applied in each:
Cross-Modality Applications: Few-shot techniques enable systems to bridge gaps between different data modalities, such as transferring knowledge from RGB images to infrared, depth, or medical imaging with minimal new examples
Few-shot learning is a powerful paradigm, but it also comes with several challenges and limitations that researchers and practitioners are actively working to address. Understanding these not only helps set appropriate expectations, but also points to current research trends aiming to overcome them.
The most fundamental issue is the inherent statistical uncertainty arising from limited samples, making it difficult to distinguish between meaningful patterns and random variations. This uncertainty often manifests as high variance in model performance, where success depends heavily on which specific examples are included in the support and query sets. Many current approaches also rely on strong inductive biases that work well for specific problem structures but fail when these assumptions are violated.
Additionally, the computational intensity of meta-learning approaches presents practical barriers to deployment. Few-shot systems also typically assume clean, well-curated support sets, whereas practical applications often involve noisy or ambiguous examples. Moreover, current approaches often neglect the active learning dimension of real-world few-shot scenarios, where intelligently selecting which examples to label can significantly improve performance.
Get exclusive insights, tips, and updates from the Lightly.ai team.