Overview of recent advancements in the field of Vision Language Models. From early contrastive learning approaches like CLIP to more advanced models like Flamingo and LLaVA.
Here's a quick look at vision language models.
Vision Language Models (VLMs) are at the forefront of AI, combining the power of Large Language Models (LLMs) and robust Vision Encoders to understand and generate text from images. This article explores the evolution of key VLM architectures, from early contrastive learning to advanced instruction-tuned models, highlighting their design choices and capabilities.
VLMs are designed to take both images and text as input and generate text as output. Their success stems from smartly combining powerful pre-trained unimodal models (separate vision and language models). This allows them to leverage existing representation learning capabilities to tackle complex multimodal tasks, pushing boundaries in areas like image generation and human-computer interaction.
Early VLMs often relied on contrastive training:
Recognizing the efficiency of using already powerful unimodal models, VLM development shifted towards methods that leverage frozen pre-trained vision encoders and LLMs, focusing solely on aligning the modalities:
Foundational models are trained on broad data at scale to adapt to many downstream tasks.
Most early VLMs had fixed interfaces. Instruction tuning, a breakthrough from LLMs, extends this to VLMs, making them more adaptable to user commands:
Text Foundational models pre-trained on large unlabeled web-scraped datasets have recently become popular. At the same time, strong vision models started emerging. They could perform well on most tasks, such as segmentation, detection, and classification, while having excellent generalisation ability and adapting to new datasets and tasks. However, there is also great value in jointly training a model with both vision and text data. The success of LLMs and vision models led to a flurry of research in Vision Language Models (VLMs) exemplified by DALL-E 2 (Ramesh et al., 2022) and Flamingo (Alayrac et al., 2022) and the development and release of new tasks and datasets for evaluation.
As of August 2024, breakthroughs like GPT-4 (OpenAI, 2023), PaLM 2 (Google, 2023) and LLAVA (Liu et al., 2023) continue to push boundaries, spurring the development of novel evaluation tasks and datasets. This convergence of vision and language promises transformative applications across industries, from advanced image generation to intuitive human-computer interaction. This article will cover some fundamental VLM architectures leading to current SOTA techniques and ideas.
Vision Language Models (VLMs) take images and text as inputs and output text. The success of VLMs relies on two prior developments:
By smartly combining these unimodal pre-trained models, we can leverage each model's representation learning capabilities to create performant VLMs on multimodal tasks.
One of the first ideas in Vision-Language Modeling was proposed in CLIP (Contrastive Language-Image Pre-training) by Radford et al. (2021). The authors questioned whether scalable pre-training methods that learn directly from web text could result in a breakthrough for vision akin to language modelling.
CLIP aimed to study the behaviours of image classifiers trained with natural language supervision at a large scale.
The core idea behind CLIP is to learn perception from supervision contained in natural language. Natural Language supervision has the added advantage over most unsupervised or self-supervised learning approaches. It doesn't "just" know a representation but also connects that representation to language, enabling flexible zero-shot transfer.
While most vision models jointly train an image feature extractor and a linear classifier to predict some label, CLIP trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. The model learns a multimodal embedding space by jointly training an image encoder and text encoder to maximise the cosine similarity of the image and text embeddings of the real pairs in the batch while minimising the cosine similarity of the embeddings of the incorrect pairings.
A key point to note here is that CLIP uses a contrastive objective, not a predictive one. That is, the model doesn't try to predict the exact words of the text accompanying each image. This choice of learning paradigm is based on the success of contrastive representation learning over equivalent predictive objectives.
Along with releasing a dataset consisting of 400M image-text pairs, the authors benchmarked CLIP’s zero-shot transfer performance on over 30 existing datasets. They found it to be competitive with prior task-specific supervised models.
CLIP has had a significant impact on the field of Vision Language Models. It has been used to curate datasets for text-to-image generation models and to rank generated images. It’s one of the critical elements for any Vision Language Model since most of them use frozen CLIP encoders to generate latent representations for images.
Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (BLIP) by Li et al. (2022) proposed a new mixture of expert Encoder Decoder (MED) architecture for effective multi-task pre-training and flexible transfer learning. This MED architecture can operate as an unimodal encoder, an image-grounded text encoder, or an image-grounded text decoder. Moreover, BLIP was jointly pre-trained with three vision-language objectives: image-text contrastive Learning, image-text matching, and image-conditioned language modelling.
Jointly optimising these objectives (a mixture of understanding and generative tasks) enables SOTA to perform well on various vision language tasks.
The text encoder and text decoder share all parameters except the Attention layers to perform efficient pre-training while leveraging multi-task Learning.
However, since Vision Language Models combine Vision and Language Models, it’s a natural assumption that they can leverage pre-trained unimodal models from the vision and language domains without requiring a joint pre-training strategy. This would lead to more efficient pre-training since we could focus solely on vision-language alignment.
A problem with this approach is that LLMs have not seen image data during their pre-training process. Several methods have been proposed to solve this “modality gap”. In this article, we’ll discuss BLIP-2, Frozen and Flamingo.
Li et al. (2023), in their paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, propose to solve this problem by efficiently learning relevant visual tokens using an attention-based approach.
In particular, they break VLM Pre-Training into two steps:
Since the querying transformer has been pre-trained to extract only relevant information from the images, it reduces the burden of the LLM to learn vision-language alignment, thus mitigating the catastrophic forgetting problem.
Tsimpoukelli et al. introduce Frozen in Multimodal Few-Shot Learning with Frozen Language Models (2021), a method for giving a pre-trained language model access to visual information in a way that extends its few-shot learning capabilities to a multimodal setting without changing its weights.
Frozen consists of a neural network trained to encode images into the word embedding space of a large pre-trained language model such that the language model generates captions for those images. The weights of the language model are kept frozen, but gradients are back-propagated through it to train the image encoder from scratch.
Since it uses a pre-trained language model, Frozen exhibits strong zero-shot performance on multimodal tasks that it was not trained on, such as visual question answering (VQA). Therefore, Frozen is a multimodal few-shot learner, bringing the language-only capabilities of rapid task adaptation enabled by prompting to a multimodal setting.
The authors refer to Frozen as a system for genuinely open-ended and unconstrained linguistic interpretation of images that often produces compelling output.
The Frozen architecture consists of a pre-trained language model and a vision encoder (a variant of NF-ResNet-50
). A given raw image is transformed into a continuous sequence to be consumed by the transformer by linearly mapping the vision encoder's output into higher dimensions and then reshaping the result as a sequence of embeddings, each with dimensionality the same as the language model's. The authors refer to this sequence as a visual prefix since it plays the same functional role in the transformer architecture as (part of) an embedding sequence of prefix tokens. During Training, only the parameters of the vision encoder are updated using paired image-caption data. This also makes the system modular as it uses an existing language model. It is also quite simple since it only involves training a visual encoder relying on the capabilities of an existing language model.
While Frozen provides an excellent framework for prompting VLMs, it alone doesn’t surpass SOTA performance. This highlights that knowledge in transformer language models can transfer to non-linguistic tasks.
Prompting has emerged as a critical tuning technique for using foundational models. Alayrac et al. (2022) introduced Flamingo for few-shot Learning on a wide range of open-ended vision and language tasks simply by being prompted with a few input/output examples.
This model must ingest a multimodal prompt containing images and videos interleaved with text. Therefore, they are visually conditioned autoregressive text generation models able to ingest a sequence of text tokens interleaved with images and videos and produce text as output.
This method leads to an image-causal modelling task wherein the full text-to-image cross-attention matrix is masked by which visual tokens the model sees at each text token. At a given text token, the model attends to the visual tokens of the image that appeared just before it in the interleaved sequence rather than all previous images.Though the model only directly attends to one image at a time, the dependency on all previous images remains via self-attention in the Language model.
Bommasani et al. (2021) first introduced the term foundational model to refer to any model trained from broad data at a scale capable of being adapted to a wide range of downstream tasks.
Florence: A New Foundation Model for Computer Vision by Yuan et al. (2021) aimed to create a foundational model for vision encompassing Space viz. Sparse (scene-level tasks) and Coarse image data (object detection), Time viz. Static (images), dynamic (video), and Modality viz. simple RGB images, videos and multi-channel images with transferability to downstream tasks such as zero-/few-shot Learning and full fine-tuning.
Florence was trained without the assumptions of traditional methods like CLIP, which assumed that each image-text pair has its unique caption, which allows other captions to be considered negative examples. This can be a limiting factor when scaling the pre-training dataset, as in web-scale data, multiple images can be associated with identical captions.
To achieve this, they employ a unified image-text contrastive learning (UniCL) paradigm in which the model is trained in the image-label-description space. In particular, given an image-text pair, they generate a triplet (x, t, y) via a text hash table, where x is the image, t is the language description (i.e., hash value), and y is the language label (i.e., hash key) indicating the index of unique language description in the dataset. This allows them to map identical language descriptions to the same hash key, i.e., language label. Thus, all image-text pairs mapped to the same label, y, are regarded as positive in the universal image-text contrastive Learning.
This allows them to unify two fundamental paradigms:
However, there are two fundamental limitations to the Florence Framework.
To address these issues, Xiao et al. (2023) released the Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, which proposed a universal backbone achieved through multitask learning with extensive visual annotations.
To develop a “universal” model capable of performing a range of tasks, the authors pre-train on several tasks across multiple granularity levels, such as Image-level understanding via image classification, captioning, and visual question answering; Region/pixel-level recognition via object detection, segmentation, and referring expression comprehension; and Fine-grained visual-semantic alignment.
Moreover, Florence 2 unifies all the above mentioned tasks under a single sequence-to-sequence language modelling objective. A vision encoder converts images into visual token embeddings, concatenated with text embeddings and processed by a transformer-based multi-modal encoder-decoder to generate the response.
During training and inference, the model is prompted using task descriptions. If the prompt is simply plain text, such as “What does the image describe?” the model does no special formatting. However, localisation tokens representing quantized coordinates are added to the prompt if the model is prompted to perform region-specific tasks. This enables the model to process region-specific tasks under a language modelling paradigm and eliminates the need for task-specific adapters.
Most of the models we have discussed so far limit the use of language to describe only the image content. While this allows us to map visual signals to language semantics, it leads to models that usually have a fixed interface with limited interactivity. More importantly, they can’t adapt to the user’s instructions.
On the other hand, Large Language Models (LLMs) have shown that language can be a universal interface for a general-purpose assistant. Moreover, recent works have used machine-generated high-quality instruction-following samples to improve the LLM’s alignment ability, reporting impressive performance compared with proprietary LLMs.
Following this, the authors of Visual Instruction Tuning present a text-only line of work for visual instruction tuning, i.e., a way to extend instruction tuning to the language-image multimodal space.
Moreover, although LLaVA is trained with a small multi-modal instruction-following dataset (∼80K unique images), it demonstrates similar reasoning results with multimodal GPT-4. The LLaVA framework is also highly efficient; empirically, pre-training on the CC-595K dataset was completed within 4 hours. Finetuning on Instruct-158K is completed within 10 hours, and Finetuning on the ScienceQA dataset is completed within 4 hours.In their follow-up work, LLaVA 1.5 Liu et al. (2023) show that the fully connected vision-language connector is incredibly powerful with simple modifications. Their newer 13B checkpoint model uses merely 1.2M publicly available data and finishes full training in ∼1 day on a single 8xA100 node. These advancements can be attributed to the use of:
To accommodate for bigger images, thereby allowing the LLM to clearly “see” the details of images, they swap the encoder with a more modern CLIPViT-L-336px
encoder.Recently, Zhang et al. (2024) also released LLaVA-NeXT, capable of handling even bigger images with an increased input image resolution of ~4x more pixels, allowing it to grasp more visual details. LLaVA-NeXT supports three aspect ratios, up to 672x672, 336x1344, and 1344x336 resolution.It allows better visual reasoning and OCR capability since it was trained with improved visual instruction-tuning data mixture. LLaVA-NeXT re-uses the pre-trained connector of LLaVA-1.5 and less than 1M visual instruction tuning samples during training.
The authors use a dynamic strategy to accommodate images of various resolutions. The image is divided into smaller resolution patches for which the vision encoder is initially trained and encoded independently. These are then combined into a single large feature map of the target resolution and fed to the LLM.
In their latest release, LLaVA-OneVision Li et al. (2024) is a family of open large multimodal models (LMMs) that improve the performance boundaries of open source LMMs in three crucial vision settings: single-image, multi-image, and video scenarios.
Simply scaling up the LLM achieves performance comparable to GPT-4V on selected benchmarks. It also employs a new Higher AnyRes strategy as a flexible visual representation framework adaptable for multi-image and video representation.
This article overviewed the recent developments in the field of Vision Language Models. From early contrastive learning approaches like CLIP to more advanced models like Flamingo and LLaVA, these systems are increasingly capable of tasks like image captioning, visual question answering, and following complex instructions involving visual content.
Saurav,
Machine Learning Advocate Engineer
lightly.ai
Get exclusive insights, tips, and updates from the Lightly.ai team.