Towards Unified Multimodal Models:Trends and Insights

Recent advancements in unified models for multimodal understanding and generation, including works like Transfusion, Next-GPT, and EMU3, highlight the trend towards frameworks capable of diverse tasks. These models typically fall into two categories:single models, which use a unified architecture to process multiple modalities, and multi-expert models, where specialized sub-models are used for each modality. Multi-expert models include different alignment strategies such as Image-Centric Alignment (e.g., ImageBind), Text-Centric Alignment (e.g., Next-GPT), and others, each focusing on aligning specific modalities for more flexible integration. The single models explore techniques like pure autoregressive methods, diffusion-based approaches, or a combination of both. This blog provides a comprehensive overview of unified multimodal models, reviewing current developments and discussing key design principles, including the use of autoregressive and diffusion mechanisms.

Introduction

“Unity is strength... when there is teamwork and collaboration, wonderful things can be achieved.” — Mattie Stepanek

In recent years, the field of multimodal understanding and generation has seen significant advancements, particularly with the rise of unified models capable of addressing a wide range of tasks. Notable examples include Meta’s Transfusion, NUS’s Show-o, and BAAI’s EMU3, which have set the stage for a growing trend: the development of single models that can process and generate information across multiple modalities, such as text, images, and more. These unified models utilize a variety of techniques, including pure autoregressive methods, diffusion-based approaches, or even hybrid combinations of both.

Unity concept resonates deeply in the context of unified models for multimodal understanding and generation. This blog aims to explore the recent developments in unified multimodal models, reviewing their current state and outlining the future directions for research and application. We will delve into the principles behind autoregressive and diffusion models, shedding light on their unique characteristics and how they can be effectively combined. Ultimately, we will discuss whether the future of unified models lies in one of these paradigms, or if a hybrid approach holds the key to even greater advancements.

Background

Autoregressive Model

Autoregressive (AR) sequence generation is a method where a sequence is generated token by token, with each token predicted based on the preceding ones. This approach models dependencies within the sequence by conditioning each output on prior outputs, effectively capturing the structure and patterns of sequential data.

Definition. For a data sequence $(x_1, x_2, ..., x_{T})$, an autoregressive model represents the joint probability as:

\[P(x_1, x_2, ..., x_T) = \prod_{t=1}^{T} P(x_t | x_1, x_2, ..., x_{t-1})\]

This formula captures the essence of autoregressive generation: predicting each token based on the prior sequence of tokens.

Why Choose Autoregressive Models for Unified Multimodal Model?

Unified data representation.: They can integrate different modalities (e.g., text, images, audio) by converting them into a unified sequence of tokens, simplifying model design.
Easier to Scale Up: Autoregressive models align well with scaling laws, making them easier to scale up in terms of model size, data, and computational resources.
Extensive Experience an Strong infra. Standing on the shoulders of giants (LLMs). Both academia and industry have relatively sufficient experiences on training and scaling AR models.

Autoregressive for Vision. The process for generating an image can be broken down pixel-by-pixel or patch-by-patch in a raster-scan order, where each pixel/patch is conditioned on previously generated content. Autoregressive models have been shown to be simple, interpretable, and effective in processing language. Images, however, are not sequential. Besides, treating image as flat sequences means that the autoregressive grows quadratically in sequence length (and the computation) . This misalignment poses challenges to the effectiveness and efficiency of AR models, which is critical to understanding their role in unified multimodal generation tasks.

Diffusion Model

Diffusion models are a class of probabilistic generative models used to synthesize data by modeling its distribution through a gradual process of noise addition and removal. These models have shown impressive results in image, audio, and video generation tasks.

Forward Process. The forward process is a Markov chain where Gaussian noise is iteratively added to the data $\mathbf{x}_0$, resulting in a series of noisy representations $\mathbf{x}_1,\mathbf{x}_2,…,\mathbf{x}_T$:

\[q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}),\]

where $β_t$ is the noise schedule, $\mathcal{N}$ refers to Gaussian distribution.

Reverse Process. The reverse process learns to denoise $x_T$ back to $x_0$ through a neural network $p_θ$, parameterized as:

\[p_θ(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t)).\]

During sampling, we start from pure Gaussian noise $x_T ~ \mathcal{N}(0, I)$ and iteratively sample $x_{t-1}$ from $p_θ$ until $x_0$ is reconstructed.

Loss Function. The neural network $ε_θ$ is trained to predict the noise added at each step $t$, using the following loss function:

\[L(θ) = \mathbb{E}_{x_0, ε, t}[ ||ε - ε_θ(x_t, t)||^2 ],\]

where $ε ~ \mathcal{N}(0, I)$ and $x_t$ refer to actual noise and noisy data at step $t$, $ε_θ(x_t, t)$ is the model’s noise prediction.

Why Involve Diffusion Models for Unified Multimodal Model?

Modeling Spatial Distributions: Diffusion models are well-suited for modeling complex spatial data distributions, such as images and videos, by gradually refining noise into structured outputs
Faster Inference Compared to Autoregressive Models: Diffusion models often have faster inference times because they generate images or videos in parallel with only few steps, unlike autoregressive models that generate token-by-token.

Preliminary of Unified Models

Unified multimodal models represent a pivotal advancement in artificial intelligence, aiming to integrate and process multiple data modalities (e.g., text, images, audio, and video) within a single model framework. These models are designed for understanding and generating data across modalities, offering flexibility and efficiency that surpass traditional task-specific or modality-specific approaches.

Key Concepts in Unified Multimodal Models:

Multimodal Understanding: Unified models are trained to understand relationships and interactions across multiple modalities (e.g., text, image, audio). For example, given a pair $(x_{\text{image}}, x_{\text{text}})$, the model learns a joint representation $z$ such that:

\[z = f_{\text{Unified Model}}(x_{\text{image}}, x_{\text{text}}).\]

Multimodal Generation: These models synthesize cross-modal outputs. For instance, generating video $x_{\text{video}}$ from text $x_{\text{text}}$ and image $x_{\text{image}}$ can be formulated as:

\[x_{\text{video}} = f_{\text{Unified Model}}(x_{\text{text}},x_{\text{image}}).\]

Cross-Modal Alignment or Generative Alignment:

A key design choice in unified models is the method of aligning modalities:

1) Cross-Modal Alignment. Aligning feature spaces of diverse modalities into a unified latent space. Formally, given $x_i$ and $y_j$ from different modalities, the goal is to minimize their alignment loss $\mathcal{L}_{\text{align}}$:

\[\mathcal{L}_{\text{align}} = \| f_{\text{Unified Model}}(x_i) - f_{\text{Unified Model}}(x_j) \|_2^2.\]

2) Generative Alignment. Models such as Show-O and Transfusion bypass explicit alignment by directly using generative loss to learn the relationships between modalities. This approach models the inter-modality relationships through tasks like sequence prediction or output reconstruction, focusing on generation rather than explicit feature alignment. For a pair of modalities $x_{\text{src}}$ and $x_{\text{tgt}}$, the loss function is a weighted sum of generative losses for each task:

\[\mathcal{L}_{\text{unified}} = \sum_{k=1}^K \lambda_k \mathcal{L}_{\text{gen}}^k,\]

where $K$ is the number of modality pairs, $\mathcal{L}_{\text{gen}}^k$ is the generative loss for the $k$-th modality pair, $\lambda_k$ is a weight balancing the importance of each task.

Model Architectures:

From the perspective of model architectures, unified multimodal systems can generally be categorized into two main types: Autoregressive Models and Mixed Architectures (AR+Diffusion):

1. Autoregressive Models. Predict the next token or step $x_{t+1}$ based on past inputs $x_{\leq t}$, examples include EMU3.

2. Mixed Architectures. Combine autoregressive and diffusion methods, leveraging their respective strengths, as seen in models like Show-o.

Taxonomy of Unified Models

Feature	Single Models	Multi-Expert Models
Architecture	Unified model with shared parameters	Specialized sub-models for each modality
Scalability	Less flexible for adding new modalities	Easily extendable by adding new experts
Performance	Balanced across modalities	Optimized for each modality individually
Complexity	Simpler pipeline	Higher due to fusion and coordination
Use Cases	General-purpose multimodal tasks	Tasks requiring high modality-specific performance

The key difference between Single Models and Multi-Expert Models lies in their approach to handling modalities. Single models use a unified architecture with shared parameters to process all modalities in a common space, offering simplicity and generalization but limited flexibility for adding new modalities. A great example of this is Chameleon, which processes all modalities using a single unified autoregressive architecture in the same feature space.

In contrast, Multi-Expert Models rely on specialized sub-models or experts tailored for each modality, providing better modality-specific performance and scalability. For instance, ImageBind employs separate pre-trained encoders for different modalities (e.g., images, text, audio) to extract features independently and aligns them into a shared embedding space for optimization. This approach allows it to leverage the strengths of pre-trained models while maintaining flexibility for each modality. However, the reliance on multiple encoders adds complexity due to the need for fusion and alignment mechanisms.

Single Model

For the single-model paradigm, existing methods can primarily be categorized into two aspects: model architecture and feature representation, as shown in the table below:

Architecture\Feature	Discrete	Continuous
Autoregressive	LWM; Chameleon; VILA-U; EMU3	MMAR
AR+Diffusion	Show-o	Transfusion;MonoFormer

Autoregressive Models v.s. Mixed Architectures (AR+Diffusion)

The fundamental difference between Autoregressive Models (e.g., EMU3, Chameleon) and AR+Diffusion Models (e.g., Show-o, Transfusion) lies in their approach.Autoregressive Models still sequentially predict the next token across all modalities, while AR+Diffusion models combine autoregressive generation for discrete tokens (e.g., text) with diffusion processes for continuous data (e.g., images and videos).

Autoregressive (AR) models and mixed architectures (AR + diffusion) differ in how they handle data, particularly for high-dimensional modalities like images and videos. AR models treat all modalities uniformly by sequentially predicting the next token, which works well for temporal tasks but struggles with capturing spatial dependencies. In contrast, mixed architectures combine AR for global structure and diffusion for spatial modeling, allowing all tokens to interact during generation. This results in more coherent outputs for image and video tasks, as diffusion handles spatial distributions through parallel denoising. While AR models are simple and efficient, mixed architectures offer better performance for spatial data at the cost of increased complexity and computational demand.

Autoregressive models and mixed architectures (such as AR+Diffusion) differ primarily in the type of attention mechanisms they employ, which significantly impacts their performance and the way they handle modality alignment:

AR-Causal Attention: Using causal masks where tokens only attend to their previous tokens. This creates a strict unidirectional flow of information.
Diffusion-Full Attention: Using full attention masks (bi-directional), allowing tokens to attend to both past and future tokens. This enables a richer, more flexible way of modeling the relationships between different parts of the input.

The modeling differences and the respective strengths and weaknesses of these two approaches are outlined below:

Aspect	Autoregressive Models	AR+Diffusion Models
Generative Strategy	Iteratively predicts the next token.	Predicting all tokens at once by iterative denoising.
Generative Fidelity	Effective for token-based sequence. (e.g., text).	Superior for high-fidelity image and video generation.
Unified Framework	Simplifies multimodal unification via tokenization.	Balances autoregressive and diffusion processes.
Training Complexity	Relatively efficient.	More computationally demanding due to diffusion.
Scalability	Scales well to large multimodal datasets.	Requires careful balancing of AR and diffusion.

Discrete v.s. Continuous

In a unified model, discrete values refer to categorical data (e.g., tokens or words) predicted sequentially, while continuous values involve real-valued data (e.g., pixels or audio signals) that are refined through a denoising process to generate high-quality outputs.

Aspect	Discrete (e.g., Text)	Continuous (e.g., Images, Audio)
Data Type	Categorical (text tokens).	Real-valued (pixels, audio signals, etc.).
Primary Focus	Token-level generation (e.g., text prediction).	Continuous signal refinement (e.g., image generation).
Training Complexity	High due to long sequences.	Moderate but requires sophisticated alignment.
Encoding Approach	Requires codebook for quantization, potential errors.	No codebook, direct continuous embeddings, avoids quantization errors.

Disscusion for Single-Model Paradigm

Autoregressive-based Models with Discrete Valued Tokenizer. Autoregressive-based models with discrete-valued tokenizers, including Chameleon, EMU3, leverage a unified tokenization framework to process diverse modalities such as text, images, and video. These models transform multimodal inputs into discrete token sequences, enabling a shared representation for both generation and understanding tasks.

Advantage - Unified Tokenization. All input data, regardless of modality, is converted into discrete tokens using techniques like VQ-VAE. This allows seamless integration of text, image, and video data into a single autoregressive sequence.

Limitations - Inference Speed. Token-by-token generation leads to slower inference times, particularly for long sequences like video.

Autoregressive-based Models with Continuous Valued Tokenizer. Autoregressive models with continuous-valued tokenizers, such as MMAR, introduce an alternative approach to unified multimodal modeling by representing data as continuous latent representations instead of discrete tokens. This method offers unique advantages in modeling high-dimensional data like images and videos while maintaining the flexibility of autoregressive frameworks.

Advantage - Continuous Representations:. Unlike discrete tokenization, these models use continuous embeddings to represent inputs, providing a richer and more compact encoding of complex modalities like video and high-resolution images.

Limitations - Task Flexibility: While excellent for understanding tasks, these models may require additional tuning to handle diverse generative tasks effectively.

Mixed Architectures with Discrete Valued Tokenizer. Mixed architectures that combine autoregressive (AR) and diffusion models, such as Show-o, represent an innovative approach to unified multimodal modeling. These architectures leverage the strengths of both AR and diffusion frameworks while operating on discrete-valued tokenized inputs, aiming to address the limitations of each individual method.

Advantage - Unified Tokenization for Text and Image/Video. Both AR and diffusion processes operate on tokenized representations, enabling seamless integration of diverse modalities such as text, images, and video within the same framework. Additionally, diffusion models excel at modeling spatial distributions, making them particularly effective for image and video generation tasks. Furthermore, inference with diffusion models tends to be faster because they process data in parallel, unlike autoregressive models that predict tokens sequentially.

Limitations - Computational Overhead and Model Complexity. Combining AR and diffusion significantly increases training and inference costs due to the sequential nature of AR and the iterative steps of diffusion. The hybrid approach introduces architectural complexity, making optimization and implementation more challenging.

Mixed Architectures with Continuous Tokenizer. Mixed architectures utilizing continuous tokenizers, such as Transfusion and MonoFormer, represent a cutting-edge approach to unified multimodal modeling. By combining autoregressive (AR) and diffusion frameworks with continuous latent representations, these models aim to balance efficiency, scalability, and high-quality performance across generation and understanding tasks.

Advantage - Continuous Tokenization and Diffusion Model. Inputs are represented as compact continuous-valued embeddings rather than discrete tokens, offering a more flexible and efficient approach to handling high-dimensional multimodal inputs. Additionally, diffusion models, with their iterative denoising process, are particularly well-suited for generation tasks in image and video modalities.

Advantage - Training Complexity. The hybrid nature of these architectures requires careful optimization, as interactions between AR and diffusion components in continuous space can be challenging to balance.

Multi-Experts

Unlike single, monolithic models trained to handle all modalities and tasks simultaneously, multi-expert architectures offer an alternative approach: leveraging specialized expert modules that align, process, and fuse information across diverse modalities. These architectures not only enable task-specific optimization but also facilitate the integration of pre-trained expert models, such as incorporating external capabilities into frameworks like ImageBind. Multi-experts are typically categorized based on their alignment focus: Image-Centric Alignment, Text-Centric Alignment, and Generalized Alignment methods.

Aspect	Image-Centric (e.g., ImageBind)	Text-Centric (e.g., Next-GPT; TextBind; SEED-X; CoDi; LaVIT )	Generalized (e.g., UniBind)
Alignment Focus	Visual-first	Text-first	Balanced across all modalities
Integration Capability	Fuses pre-trained visual-centric models	Leverages pre-trained language models	Incorporates multi-expert pre-trained modules
Strengths	Robust spatial and visual correlations	Strong text-based reasoning and generation	Versatile, supports diverse tasks
Limitations	Limited in text/audio-heavy tasks	Struggles with purely visual or auditory tasks	Increased computational complexity

Image-Centric Alignment

Image-centric alignment model refers to the use of images as the core pivot to connect various data modalities. Images are highly structured and versatile representations that can be naturally associated with other modalities like text, audio, or sensor data.

Here’s how ImageBind achieves this:

Cross-Modal Alignment: Each modality (text, audio, sensor data, etc.) is independently encoded into the same shared embedding space. The model does not require direct pairwise data between all modalities. Instead, it binds each modality to images, which indirectly links the modalities.
InfoNCE Loss: InfoNCE loss is used to align embeddings of different modalities. For instance, it maximizes the similarity between embeddings of an image and its corresponding text while minimizing the similarity with unrelated texts.

The InfoNCE loss is defined as:

\[\mathcal{L}_{\text{InfoNCE}} = - \log \frac{\exp(\text{sim}(q, k^+)/\tau)}{\sum_{i=1}^N \exp(\text{sim}(q, k_i)/\tau)},\]

Where:

$( \text{sim}(q, k) )$: Similarity function, such as cosine similarity.
$( \tau )$: Temperature parameter that controls the sharpness of the distribution.
$( N )$: Total number of candidates (including both positive and negative samples).

Text-Centric Alignment

Text-Centric Alignment is an approach where text embeddings act as the anchor or central hub for aligning different modalities like images, audio, and video. It relies on pre-trained large language models (LLMs) to generate text representations that allow other modalities to be mapped into a common embedding space. The alignment between modalities is achieved by projecting them into the text space and optimizing for similarity.

Text-Centric Alignment Formulation:

1. Text Embedding Space: Let $T$ represent the text modality, which is encoded by a pre-trained language model to produce an embedding vector $t$. The text embedding $t$ is:

\[\mathbf{t} = \text{LLM}(T)\]

2. Modality Embedding: Other modalities, such as images $I$, audio $A$, or video $V$, are encoded into their respective embedding spaces using dedicated encoders. For instance, the image $I$ is encoded as $\mathbf{i}$ using a vision model:

\[\mathbf{i} = \text{VisionEncoder}(I)\]

Similarly, audio $A$ is encoded as $\mathbf{a}$ using an audio encoder:

\[\mathbf{a} = \text{AudioEncoder}(A)\]

2. Aligning Modalities with Text: Once we have embeddings for both text and the other modalities, the goal is to align them in a shared embedding space. This alignment can be done using a contrastive loss or other similarity-based loss functions. The alignment objective is:

\[\mathcal{L}_{\text{align}} = - \log \frac{\exp(\text{sim}(\mathbf{t}, \mathbf{m}) / \tau)}{\sum_{i=1}^N \exp(\text{sim}(\mathbf{t}, \mathbf{m}_i) / \tau)}\]

Where $m$ represents the embedding of a modality (image, audio, etc.). $\text{sim}()$ is the similarity function (e.g., cosine similarity). $\tau$ is the temperature hyperparameter that controls the sharpness of the distribution. N is the number of possible modality samples.

Generalized Alignment

Generalized Alignment is an approach to multimodal learning that does not center any single modality, like text or image, but instead creates a unified space for all modalities based on a knowledge base or feature-centered approach. This method aims to align different modalities by utilizing shared characteristics derived from a broader knowledge base, rather than anchoring on one modality’s features.

In generalized alignment, the learning model is trained to map various modalities (e.g., text, images, audio) into a shared feature space without giving preference to one modality over others. The alignment is based on common semantic features or concepts that exist across modalities, allowing the model to learn more flexible, cross-modal relationships.

Formulation for Generalized Alignment:

1. Feature Representations: Let $M_i$ represent the embeddings for each modality $i$, where $M_i$ could be the image, text, or audio features. The model aims to align all modality embeddings $M_i$ into a common shared space $F$ based on their knowledge base features:

\[M_i = \text{FeatureExtractor}_i(M_i)\]

2. Knowledge Base Representation: The knowledge base $K$ represents a shared space that captures the common semantic features of all modalities:

\[K = \text{KnowledgeBase}(F)\]

3. Alignment Objective: The learning objective is to map each modality embedding to the knowledge base space $K$ while maintaining their relationships:

\[L_{\text{align}} = - log \frac{\text{exp}(\text{sim}(M_i, K_i) / τ)}{\sum_{j=1}^{N} \text{exp}(\text{sim}(M_i, K_j) / τ)}\]

Where $M_i$ is the embedding of modality $i$ (image, text, audio). $K_i$ is the corresponding knowledge base embedding for the modality. $\text{sim}(., .)$ is the similarity function (e.g., cosine similarity). $τ$ is the temperature hyperparameter that controls the sharpness of the distribution. $N$ is the number of modality samples.

Challenges

1. Technical Challenges

Architecture Design Uncertainty: Autoregressive or Autoregressive + Diffusion Hybrid or Alternative Architectures.
Cross-modal Representation Alignment: Aligning representations across diverse modalities (e.g., text, image, video, and audio) is non-trivial due to their fundamentally different structures.

2. Data Challenges

Multimodal Dataset Diversity: Collecting and curating datasets that cover all relevant modalities, tasks, and domains is resource-intensive and often biased toward specific modalities (e.g., text-heavy datasets).
Data Quality and Noise: Large multimodal datasets often contain noisy or misaligned data, leading to suboptimal performance or spurious correlations.
Modality Imbalance: Some modalities, such as text, have abundant training data, while others, like video or audio, are relatively underrepresented, making balanced training challenging.

3. Task-Specific Challenges

Balancing Task Generality and Specificity: In certain specialized downstream tasks, the performance of current unified models falls significantly short compared to large models specifically designed and optimized for those tasks. For example, models like EMU3 and Chameleon currently appear to be far less capable than SORA and other specialized video generation models.
Output Modality Compatibility: Generating compatible outputs across multiple modalities (e.g., synchronizing video frames with audio and text) can be difficult.

Conclusion

Unified multimodal models hold immense potential to revolutionize AI by integrating diverse modalities and tasks into a single framework, enabling unprecedented versatility and generalization. However, realizing this vision comes with significant challenges, including unresolved trade-offs in architecture design, the complexity of aligning diverse data, and balancing efficiency with scalability. While current models demonstrate promising progress, their performance on specialized tasks often lags behind domain-specific models, highlighting the need for further innovation. Future advancements will likely depend on the development of novel architectures that harmonize flexibility, task-specific optimization, and computational efficiency, paving the way for more robust and capable unified models.