Bridging the Modality Gap: Enhancing Document Retrieval with Multimodal Embeddings

In an era where documents seamlessly blend rich textual and visual content, traditional retrieval systems often fall short. This post explores how Multimodal Embeddings bridge the modality gap, introducing advanced models such as VISTA and ColPali, along with innovative evaluation methods, enhanced model interpretability, and an analysis of their performance.

In today’s data-driven world, the volume and complexity of information have grown exponentially. Documents are no longer confined to plain text; they now encompass a rich blend of images, charts, tables, and intricate layouts. This evolution presents a significant challenge: How do we effectively retrieve and analyze information from such complex, multimodal documents?
Traditional retrieval systems, primarily designed for text-only data, often falter when faced with this complexity. They struggle to extract and interpret the valuable information embedded within various modalities. This limitation hampers our ability to harness the full potential of the data at our disposal.
Enter Multimodal Embeddings—an innovative approach that leverages both textual and visual data to revolutionize document retrieval. By bridging the modality gap between different data types, MVS(MultiModal Vector Search) promises to make information retrieval more accurate and efficient than ever before.
In this blog post, we’ll delve into:

Introduction

Figure 1: Integration of text and visual modalities in document retrieval.

With the rise of sophisticated document formats that integrate text, images, and complex layouts has rendered traditional text-based retrieval systems inadequate. The richness of multimodal documents requires systems that can understand and process multiple data types simultaneously. The modality gap—the disconnect between different types of data representations—poses a significant hurdle in achieving effective retrieval.

To bridge this gap, advanced multimodal systems are essential. By aligning and embedding various data types, these systems not only interpret complex documents but also open doors to more powerful, nuanced information retrieval across diverse formats.

Challenges in Document Retrieval

Figure 2: Illustration of the increasing levels of detail in images, from basic visuals to more information-dense representations.

Modality-Specific Challenges

Retrieving information from today’s complex documents is a multifaceted problem that requires a nuanced approach. Documents now often contain:

Each modality presents unique technical obstacles:

  1. Textual Data:
    • Language Ambiguity: Synonyms, homonyms, and context-dependent meanings.
    • Multilingual Content: Documents may contain multiple languages or dialects.
  2. Visual Data:
    • Image Quality: Low-resolution images can hinder recognition.
    • Complex Visuals: Diagrams and charts may contain dense information that’s hard to parse.
  3. Structural Layout:
    • Non-linear Reading Paths: Multi-column texts and inserts can confuse linear text processors.
    • Embedded Elements: Images and tables interwoven with text complicate parsing.

Addressing these challenges requires advanced retrieval systems that are capable of seamlessly integrating and processing each modality, making it possible to extract meaningful insights from even the most complex document formats.

Modality Gap

The modality gap refers to the disconnect between text and image embeddings in multimodal models. Despite shared semantic objectives, these embeddings often occupy distinct regions in the space, making it difficult for models to relate information across modalities.

Causes of the Gap

  1. Separate Embedding Spaces: Text and images are encoded differently, forming distinct clusters.
  2. Contrastive Learning Bias: Training inadvertently emphasizes modality-specific features.
  3. Initialization Bias: Pretrained encoders begin with cone-shaped distributions, reinforcing separation.

Evidence from Flickr8k Dataset

Using 1,000 text-image pairs from the Flickr8k test set, each pair consists of a single image and five caption texts, all describing the same image. Embeddings were generated for both the texts and the image, and cosine similarity was calculated and visualized.

Figure 3: Visualization of the modality gap between text and image embeddings.

Implications

For more details on the Modality Gap, refer to this article.

Limitations of Traditional Models

Text-Based Models

Text-based retrieval models—leveraging techniques like TF-IDF, BM25, and transformer-based embeddings such as BERT—have been the cornerstone of information retrieval. They excel at understanding and retrieving information when text is the primary medium.

Limitations:

Vision-Based Models

Vision-based retrieval models, utilizing architectures like convolutional neural networks (CNNs) or vision transformers (e.g., ViT, Swin Transformer), extract features from visual content, focusing on images, diagrams, and spatial layouts.

Limitations:

Advanced Evaluation Metrics for Ranking

Evaluating retrieval systems, especially those handling multimodal data, demands metrics that account for both relevance and ranking position.

Mean Reciprocal Rank (MRR)

Definition: The MRR measures how quickly a system retrieves the first relevant document.

\[\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}\]

Where:

Importance:

Normalized Discounted Cumulative Gain (nDCG)

Definition: nDCG evaluates ranking quality by considering the position of relevant documents and assigning higher importance to top-ranked results.

\[\text{nDCG}_p = \frac{1}{\text{IDCG}_p} \sum_{i=1}^p \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}\]

Where:

Importance:

Mean Average Precision (MAP)

Definition: MAP computes the mean of average precision scores across all queries.

\[\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{n_q} \sum_{k=1}^{n_q} \text{Precision}(k) \times \text{rel}(k)\]

Where:

Importance:

Recent Advancements in Multimodal Retrieval Models

VISTA: Vision-Augmented Text Embeddings

Figure 3: VISTA's architecture integrates visual tokens into text embeddings.

Overview:

VISTA (Visualized Text Embedding For Universal Multi-Modal Retrieval) is a model that aims to enhance text embeddings with visual information, effectively bridging the modality gap.

How VISTA Works:

Strengths:

Limitations:

ColPali: Efficient Document Retrieval with Vision Language Models

Figure 4: ColPali's unified embedding space for text and images.

Overview:

ColPali is a state-of-the-art multimodal retrieval model that leverages Vision-Language Models (VLMs) to create a unified embedding space for text and images.

How ColPali Works:

Strengths:

Limitations:

Interpretability of Advanced Models in Documents

Importance of Interpretability

Understanding how a model like ColPali makes retrieval decisions is crucial, particularly in sensitive domains like finance, where accuracy and accountability are paramount.

Generating Heatmaps Using Attention Mechanisms

Models leverages attention mechanisms to produce interpretable heatmaps that highlight the most relevant regions of a document in response to a query. By computing attention scores between query tokens and image patches, the model identifies which parts of the document image are most influential in the retrieval process.

How Heatmaps are Generated:

  1. Image Patches and Query Tokens: The document image is divided into fixed-size patches, and the query is broken down into individual tokens.
  2. Embedding Computation: The model computes embeddings for both the image patches and the query tokens using its vision and language encoders.
  3. Attention Score Calculation: Attention scores are calculated by taking the dot product between each query token embedding and each image patch embedding.
  4. Normalization and Mapping: These scores are normalized to highlight the most significant interactions and are mapped back onto the spatial layout of the image patches.
  5. Visualization: The normalized attention scores are overlaid onto the original document image, creating heatmaps that visually represent the areas of focus for each query token.

This approach provides a transparent way to understand which parts of the document are most relevant to the query, combining both textual and visual information.

Heatmap Insights: Case Studies

Example 1: Alibaba’s 10-K Report

Figure 5: Heatmaps overlaid on Alibaba's 10-K report highlighting "artificial" and "intelligence".

Example 2: Royal Bank of Canada’s Annual Report

Figure 6: Heatmaps highlighting relevant terms "stock" and "compensation" in RBC's annual report.

These examples showcase how ColPali’s use of attention-based heatmaps provides interpretable insights into its retrieval decisions, highlighting the relevance between query tokens and document regions.

For more details on the interpretability and the underlying code, refer to this article.

Innovative Evaluation Methods (Simple yet Effective)

Feature Search Experiments

Objective:

Assess how effectively models generate embeddings that capture similarities between documents across modalities.

Methodology:

  1. Embedding Extraction: Used models like BGE-M3, VISTA, ColPali, and ColQwen.
  2. Similarity Computation: Calculated cosine similarity between document embeddings.
  3. Evaluation Metrics: Employed metrics such as Precision@K, Recall@K, MRR, MAP, and nDCG.

Results:

Feature Similarity Experiments

Objective:

Evaluate the usefulness of embeddings in differentiating between various document types.

Methodology:

  1. Prototype Vectors Generation: Created prototype vectors for each document category (e.g., invoices, contracts, reports) representing the “center” of the embedding space for that class.
  2. Similarity with Cluster Centers: Compared new document embeddings to these prototype vectors using cosine similarity, classifying each document into the category with the highest similarity score.
  3. Evaluation Metrics: Used metrics such as Accuracy and Recall to assess how well the embeddings captured similarities within categories and differences between them.

Findings:

Interactive Embedding Visualization

Below is an interactive t-SNE plot showing document embeddings colored by category. This visualization provides insights into how different models (e.g., VISTA, ColPali, SigLIP) represent document categories (e.g., forms, invoices, identity documents). The embeddings are displayed in a two-dimensional space, highlighting clustering patterns and category separations.

Figure 7: Interactive t-SNE plot of document embeddings colored by category.

This additional visualization allows for a deeper understanding of how well multimodal models differentiate between document types, bridging the modality gap through effective embedding representations.

Experimental Evaluations

To evaluate the performance of multimodal retrieval models, we utilized the ViDoRe Benchmark, a comprehensive collection designed for assessing document retrieval using visual features. This benchmark includes datasets formatted in a Question-Answering (QA) style to simulate realistic retrieval scenarios.

These datasets encompass a wide range of document types, including financial reports, legal documents, academic papers, manuals, and healthcare records. Each dataset presents unique challenges due to varying content complexity, layouts, and modality combinations.

Performance Comparison:

The table below summarizes the performance of various models across the ViDoRe datasets, measured by the Normalized Discounted Cumulative Gain at rank 5 (NDCG@5).

Model Name Average TAT-DQA Shift Project Artificial Intelligence Government Reports ArxivQA DocVQA Healthcare Industry InfoVQA Energy TabFQuad
ColQwen2 89.3 81.4 90.7 99.4 96.3 88.1 60.6 98.1 92.6 95.9 89.5
ColPali 81.3 65.8 73.2 96.2 92.7 79.1 54.4 94.4 81.8 91.0 83.9
VISTA* 70.8 56.9 78.6 86.8 89.3 39.4 32.2 91.1 75.0 87.7 71.2
E5-Large* 65.0 51.0 61.1 87.9 84.8 34.0 27.8 85.5 63.5 81.6 73.1
BGE-M3* 67.0 43.8 73.1 88.8 80.4 35.7 32.9 91.3 71.9 83.3 69.1
BM25 65.5 62.7 64.3 92.8 83.9 31.6 36.8 87.2 62.9 85.9 46.5
SigLIP 51.4 26.2 18.7 62.5 66.1 43.2 30.3 79.1 64.1 65.7 58.1
Jina-CLIP 17.7 3.3 3.8 15.2 21.4 25.4 11.9 20.8 35.5 19.7 20.2

Note: Models marked with * (e.g., VISTA, E5-Large, BGE-M3) have been re-evaluated on the ViDoRe Benchmark.

Observations:

Conclusion

The shift from unimodal to multimodal approaches in document retrieval is revolutionizing how we access complex information. Models like VISTA, ColPali, and ColQwen2 not only bridge the modality gap but also set new benchmarks for performance across diverse and complex datasets.

Key Takeaways:

  1. Multimodal Models Excel: Combining textual and visual features significantly improves retrieval accuracy, especially for documents with complex layouts.
  2. Advanced Models Address Modality Challenges: Models like ColPali and ColQwen2 create unified embedding spaces, allowing seamless integration of different data types.
  3. Importance of Domain Adaptation: High-performing models adapt well to various document types and domains, effectively handling specific jargon and layouts.
  4. Interpretability Matters: Interpretability is essential for user trust and compliance, with models like ColPali providing transparent retrieval processes through attention-based heatmaps.
  5. Innovative Evaluation is Crucial: New benchmarking strategies and evaluation metrics are vital for assessing the strengths of multimodal models in complex retrieval tasks.

“The modality gap isn’t just being bridged—it’s being obliterated.”

“The future is bright, and it’s multimodal.”

For attribution in academic contexts, please cite this work as
        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
  
BibTeX citation
        PLACEHOLDER FOR BIBTEX