Multi-modal Learning: A Look Back and the Road Ahead

Advancements in language models has spurred an increasing interest in multi-modal AI — models that process and understand information across multiple forms of data, such as text, images and audio. While the goal is to emulate human-like ability to handle diverse information, a key question is: do human-defined modalities align with machine perception? If not, how does this misalignment affect AI performance? In this blog, we examine these questions by reflecting on the progress made by the community in developing multi-modal benchmarks and architectures, highlighting their limitations. By reevaluating our definitions and assumptions, we propose ways to better handle multi-modal data by building models that analyze and combine modality contributions both independently and jointly with other modalities.

Introduction

Humans constantly use multiple senses to interact with the world around us. We use vision to see, olfaction to smell, audition to hear, and we communicate through speech. Similarly, with recent multi-modal artificial intelligence (AI) advancements, we now see articles announcing “ChatGPT can see, hear and speak”. But there’s a fundamental question underlying this progress:

To unpack this question, we show examples that illustrate the basic ambiguity of defining modalities for machine learning.

A dog image with caption represented with RGB channels and texture as four image modalities, alongside tokenization strategies as modalities for text.

Example 1: Images

Images are commonly considered as a single modality for both humans and AI. For humans, an image is a unified representation of visual information. However, machine learning models can perceive images in various ways :

These decompositions raise an important question:

Example 2: Text

Text is commonly treated as a single modality in language models, however, the method of tokenization—how text is broken into atomic units—can vary widely:

This leads us to ask:

Although the term “multi-view” is sometimes used to describe these variations, the line between “multi-view” and “multi-modal” is often blurred, adding to the confusion. This is an important question because often different tokenization strategies perform differently on different tasks. While BPE and UnigramLM i.e. world-level embeddings are commonly used for LLMs, character level encoding works better than world-level tokenization for multilingual translation.

Example 3: Medical data

Consider diagnosing skin lesion using both image data and tabular features, such as patient’s age, demographic information and characteristics of the lesion along with the anatomical region. Each modality independently and jointly contributes to detecting lesion. This prompts the question,

Why is modality grouping information important?

The definition and grouping of modalities is important because it affects how we design models to process and integrate different types of data. Often the objective of prior studies is to ensure that models capture interactions between all mdoalities for the downstream task. This goal has led to two parallel lines of work. One approach focuses on developing new benchmarks to capture this interaction. These benchmarks often exhibit uni-modal biases, leading to subsequent iterations or new benchmarks intended to better evaluate multi-modal capabilities. The other approach emphasizes building complex architectures designed to learn interactions between modalities more effectively.

In this blog post, we examine the community’s progress in both of these directions and why they have fallen short of meaningful impact. We then propose ways to analyze and capture the relative importance of individual modalities and their interaction for the downstream task.

Reflection on the Progress so Far

Over the past decade, numerous multi-modal benchmarks and model architectures have been proposed to evaluate and enhance the multi-modal learning capabilities of AI models. Yet, we continue to struggle with making meaningful progress due to benchmarks not accurately representing real-world scenarios and models failing to capture the essential multi-modal dependencies effectively. We reflect on the progress made in these two areas and discuss why these approaches have not been sufficient in obtaining the desired results.

Are benchmarks enough?

List of benchmarks for multi-modal learning from a recent work.

Numerous benchmarks have been developed for multi-modal learning (see Figure above). Many were initially designed to evaluate whether models could effectively capture interactions between modalities for downstream tasks. The early benchmarks often exhibited a reliance on uni-modal biases, allowing models to perform well even when ignoring certain modalities. Although subsequent iterations of these benchmarks made an attempt to address these issues, these updates often fail to fully resolve the underlying problems. Instead, they highlight persistent challenges that reflect the complexity of real-world scenarios where multi-modal models continue to struggle.

For instance, consider visual question answering (VQA), which has undergone multiple updates since its inital release in 2015:

  1. VQAv2 (2016): Attempted to mitigate language bias by providing distinct answers for two different image-question pairs. Despite the subsequent updates, VQAv2 continues to be a benchmark in research and evaluations for recent models like Gemini, ChatGPT and LLaVA.
  2. VQA-CP (2018): Adjusted answer distributions between training and test sets to reduce prior biases.
  3. VQA-CE (2021): Emphasized multi-modal spurious dependencies among image, text, and answer to better capture multi-modal interactions.
  4. VQA-VS (2022): Broadened the scope to explore both uni-modal and multi-modal shortcuts in a more comprehensive manner.

Even after a decade of refinements, VQA benchmarks continue to grapple with biases and limitations, raising concerns about the direction of constructing more and more benchmarks. While many benchmarks aim to evaluate different capabilities of models, they often result in only incremental improvements. Similar challenges are evident in non-VQA benchmarks such as Natural Language Visual Reasoning (NLVR), action recognition using RGB frames combined with optical flow and audio, 3D object classification employing front and rear views as two modalities, and many others.

These examples highlight a critical limitation:

Are complex multi-modal architectures enough?

Conventional multi-modal methods use different fusion methods to tackle the underlyung task effectively.

In the task of VQA, as shown in the Figure above, many methods would process the input and text modalities to have same dimensions and then use one of the many conventional “multi-modal” algorithms:

  1. Early fusion: These methods concatenate modality features followed by joint processing of features. This involves using a unified encoder with shared parameters across all modalities. This approach is common in early multi-modal learning based methods to recent transformer based methods.
  2. Intermediate fusion: These methods fuse specific layers, rather than sharing all the parameters.
  3. Late fusion: Thse methods learn separate encoder representations for each modality followed by fusion. The fusion often uses additional layers to capture the interaction between these modalities on top of the individual representations.
Examples of scenarios where GPT-4V fails to accurately answer questions about the provided images.

Despite a decade of advancements in developing these complex architectures, many multi-modal models continue to struggle with effectively integrating vision and text, often disregarding one modality in the process (see Figure above). For example, while humans consistently achieve around 95% accuracy on VQA, recent AI models such as GPT-4V and Gemini only reach about 40%, with others like LLaVA-1.5, Bing Chat, mini-GPT4, and Bard performing even worse—sometimes falling below random performance levels.

Some studies attribute this persistent performance gap to inherent imbalances or design flaws within the benchmarks themselves, while others argue that the issue lies with models failing to effectively learn from the interaction of modalities. This debate has prompted successive iterations of benchmarks and models aimed at addressing these challenges. Several survey papers provide a comprehensive overview of recent developments in the field.

Towards Meaningful Progress in Multi-modal Learning

To drive meaningful progress in multi-modal learning, we need to move away from simply creating more benchmarks or building increasingly complex architectures. While these efforts have advanced the field incrementally, they haven’t tackled its fundamental challenges. Instead, we propose approaching the field from a two stage perspective below:

Analysis of the strengths of individual and combination of modalities

Relative inter- and intra-modality dependency strengths across multiple benchmarks.

For any dataset or benchmark, we recommend to start with the assumption of what constitutes a modality in the context of the desired task. This definition should not be limited to human-perceived notions of modality but should critically evaluate and challenge these assumptions. The goal however is not merely to label an input as multi-modal; rather, it is to assess whether such labeling provides meaningful advantages for model performance or understanding.

To answer that question, we recommend a thorough examination of the dependencies for each defined modality, both individually and jointly with other modalities for the target task. These dependencies are categorized as intra-modality dependencies, which represent interactions between individual modalities and the target label and inter-modality dependencies, which captures interaction between modalities and label.

Several studies have evaluated benchmarks involving images and text as two modalities, and we illustrate how these dependencies differed across benchmarks based on prior studies in the Figure above. For datasets such as SQA-I, MMMU, fastMRI, MathVista, UCF101, intra-modality dependencies are more important, as removing one modality all together does not affect model performance. Conversely, for datasets like MMB, SEED-I, GQA, ChartQA and so on, both inter- and intra-modality dependencies are important, as both uni-modal and inter-modality models obtain more than random performance. For benchmarks such as MME, OCRBench, NLVR2 and others, inter-dependencies are more important; removing one modality results in random performance; underscoring the importance of modality interactions.

For all these benchmarks, specific architectural choices—such as the type of fusion method or the backbone architecture for vision and language models—exhibit minimal impact on performance, as noted in prior studies. Thus

This analysis provides a better understanding of the importance of each modality and their interaction in the corresponding dataset. Such understanding enables us to prioritize and weight these contributions when constructing multi-modal models, as elaborated in the next section.

Building better multi-modal models

Product of experts model combining the inter-modality and intra-modality models.

With a clear understanding of each modality’s contribution, we can prioritize the dependencies that are the most important, or simply build a “product of experts” approach to combine these dependencies (see Figure above). Particularly, for two modalities \(\{x_1, x_2\}\) and label $y$, the output can be expressed as the product of two sets of models: one capturing the importance of individual modalities for the label (intra-modality predictors) and the other focusing on the importance of their interactions for the label (inter-modality predictor) as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def forward(x_1, x_2):
  """
  Forward pass for multi-modal classification using Product of Experts (PoE).
  Combines predictions from intra-modality models and inter-modality model.

  Args:
    x_1 (Tensor): Input from modality 1 
    x_2 (Tensor): Input from modality 2

  Returns:
    Tensor: Log-probabilities over output classes, shape [batch_size, num_classes]
  """
  # Intra-modality predictors (separate models for each modality)
  intra_output_1 = self.intra_model_1(x_1)
  intra_output_2 = self.intra_model_2(x_2)

  # Inter-modality predictor (fusion model over both modalities)
  inter_input = torch.cat([x_1, x_2], dim=-1)
  inter_output = self.inter_model(inter_input)

  # Convert outputs to log-probabilities
  log_probs_intra_1 = torch.log_softmax(intra_output_1, dim=-1)
  log_probs_intra_2 = torch.log_softmax(intra_output_2, dim=-1)
  log_probs_inter = torch.log_softmax(inter_output, dim=-1)

  # Product of Experts: add log-probabilities from each expert
  combined_log_probs = log_probs_intra_1 + log_probs_intra_2 + log_probs_inter

  # Normalize
  log_normalizer = torch.logsumexp(combined_log_probs, dim=-1, keepdim=True)
  log_probs = combined_log_probs - log_normalizer

  return log_probs

The code above combines the output log probabilities in an additive way and has been shown to work effectively across various healthcare, language, and vision benchmarks, even when the relative strength of these dependencies is not known. Such an approach does not fundamentally alter the multi-modal learning problem; instead, it offers a structured way to balance individual and joint modality contributions. By explicitly modeling the importance of individual modalities, this approach mitigates uni-modal biases.

This comes with the trade-off of increased parameter requirements, which could impact efficiency. We believe future research should focus on optimizing this framework to reduce its computational cost. Progress in this direction is important, as current trends often attempt to address these challenges by either expanding datasets or increasing architectural complexity as highlighted above—approaches that have not led us to efficient or scalable solutions.

Takeaway

Current approaches to multimodal learning tend to overemphasize the interaction between modalities for downstream tasks, resulting in benchmarks and architectures narrowly focused on modeling these interactions. In real-world scenarios, however, the strength of these interactions are often unkown. To build more effective multimodal models, we need to shift our focus toward holistically understandinging the independent contributions of each modality as well as their joint impact on the target task.

Acknowledgement

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) with a grant funded by the Ministry of Science and ICT (MSIT) of the Republic of Korea in connection with the Global AI Frontier Lab International Collaborative Research, Samsung Advanced Institute of Technology (under the project Next Generation Deep Learning: From Pattern Recognition to AI), National Science Foundation (NSF) award No. 1922658, Center for Advanced Imaging Innovation and Research (CAI2R), National Center for Biomedical Imaging and Bioengineering operated by NYU Langone Health, and National Institute of Biomedical Imaging and Bioengineering through award number P41EB017183.

For attribution in academic contexts, please cite this work as
        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
  
BibTeX citation
        PLACEHOLDER FOR BIBTEX