Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Vision language models (VLMs), such as GPT-4o, have rapidly evolved, demonstrating impressive capabilities across diverse tasks. However, much of the progress in this field has been driven by engineering efforts, with a limited understanding of how these models work. The lack of scientific insight poses challenges to further enhancing their robustness, generalization, and interpretability, especially in high-stakes settings. In this work, we systematically review the use of mechanistic interpretability methods to foster a more scientific and transparent understanding of VLMs. Specifically, we examine five prominent techniques: probing, activation patching, logit lens, sparse autoencoders, and automated explanation. We summarize the key insights these methods provide into how VLMs process information and make decisions. We also discuss critical challenges and limitations that must be addressed to further advance the field.
Vision language Models (VLMs), such as GPT-4V or LLaVA , have achieved remarkable success across a wide range of tasks, including including Image Captioning , Visual Question Answering (VQA) , and Multimodal Reasoning . These advancements have driven innovation in diverse fields, such as virtual assistants , autonomous robotics , and medical diagnostics . However, despite their rapid adoption, the internal mechanisms of these models remain largely opaque, raising significant concerns about their reliability, robustness, and interpretability—particularly in high-stakes applications .
Interpretability research offers a promising path to address these challenges. Mechanistic interpretability, in particular, seeks to uncover the inner processes of neural networks and explain how specific outputs are generated . By applying these techniques to VLMs, researchers can gain valuable insights into how these models represent, process, store, and integrate visual and linguistic information, advancing both theoretical understanding and practical utility.
In this work, we examine how mechanistic interpretability methods can illuminate the inner workings of VLMs. We review five key techniques—probing , activation patching , logit lens , sparse autoencoders , and automated explanations —detailing their mechanisms, applications, and the insights they provide through concrete examples. These methods help answer critical questions, such as what information is encoded in VLM representations , how and when visual and linguistic modalities are integrated , and how individual neurons contribute to the model’s decision-making process .
Additionally, we discuss the limitations of current interpretability methods and highlight five key directions for future research: developing approaches that are more generalizable, scalable, vision-centric, dynamic, and capable of macro-level analysis. For instance, the heterogeneity of VLMs calls for interpretability methods that can adapt across diverse models; the micro level of mechanistic interpretability needs to be complemented by a macro-level perspective for a broader understanding. By addressing these challenges, we aim to pave the way for more transparent, reliable, and capable vision language models.
Current Methods
In this section, we review mechanistic interpretability methods applied to vision language models (VLMs), which aim to uncover the internal processes of these VLMs process visual and language information and explain how they produce specific outputs. Key techniques discussed include probing, activation patching, logit lens analysis, sparse autoencoders, and automated explanations.
Probing
Illustration of probing in neural networks: a simple classifier is trained on intermediate representations of a vision language model to predict target properties, revealing the model’s encoding strengths and limitations.
What is Probing
Probing is a diagnostic technique used to analyze the internal representations of neural networks. It helps researchers identify whether specific types of information are encoded within the model by training auxiliary classifiers—referred to as probes—on the model’s intermediate outputs. This method is particularly useful for understanding what a model has learned and how it organizes information across its layers.
How Probing Works
Probing involves training supervised classifiers, typically simple ones like linear probes, to predict specific properties from the internal representations of a model. These properties can include linguistic, visual, or multimodal features. The probe’s performance indicates whether the target information is accessible in the model’s representations:
High accuracy: Suggests the property is well-encoded.
Low accuracy: Indicates the property may be absent or deeply entangled.
Linear probes are often preferred because their simplicity ensures that high accuracy reflects the quality of the model’s representations, rather than the complexity of the probe itself .
Example
To illustrate, consider a vision language model analyzing the caption: “A cat sitting on a mat.”
A probe could be trained to predict spatial relationships (e.g., object positioning) from intermediate representations of the model.
Another probe might predict visual attributes, such as “furry” or “striped,” encoded in the text embeddings.
If the probe achieves high accuracy, it suggests that these properties (spatial relationships or visual attributes) are captured in the model’s representations. If accuracy is low, the information may be missing or insufficiently disentangled to be easily accessed.
Key Findings from Existing Works
Most research on probing tasks in VLMs focuses on two primary objectives: identifying the concepts these models struggle to capture and assessing the relative importance of visual and linguistic modalities.
Cao et al. introduced the VALUE (Vision-And-Language Understanding Evaluation) framework, which developed a set of probing tasks to explain individual layers, heads, and fusion techniques. This study reveals several important insights: pre-trained models often prioritize language over vision in multimodal tasks (modality prioritization); specific attention heads are effective at capturing interactions between visual and linguistic modalities (cross-modal interactions); and visualization of attention mechanisms has revealed interpretable relationships, such as object-object interactions in images (attention visualization).
Studies have also explored diverse model capabilities, such as visual semantics , verb processing , numerical reasoning , and spatial reasoning . A notable line of research compared representations at different training stages, such as pre-training versus fine-tuning, using carefully designed datasets to minimize biases .
Method Variants and Limitations
Limitations:
Correlation vs. Causation: High probe accuracy indicates correlation, not causation; the model may encode the information but not actively use it for predictions .
Task Design: Probing tasks must be carefully designed to avoid confounding factors or misleading results .
Model-Specific Dependencies: Probing results are often architecture-dependent, limiting their generalizability across models.
Activation Patching
Activation patching compares model behavior under clean, corrupted, noising, and denoising conditions. The results highlight how noising disrupts and denoising restores key logits (A and B values), demonstrating the method’s utility in identifying critical layers and components.
What is Activation Patching
Activation patching (also known as causal tracing or causal mediation analysis) is an interpretability technique for neural networks. It selectively modifies internal activations while keeping others constant, allowing researchers to investigate how specific components contribute to model behavior. This method provides causal insights, helping identify critical components and potential interventions to improve performance and robustness.
How Activation Patching Works
The activation patching process typically involves five steps:
Save Activations: Record the internal activations of a model when processing clean and corrupted inputs.
Select Target Activations: Identify the specific activations to modify.
Patch Activations: Replace activations from one input (e.g., corrupted) with those from another (e.g., clean).
Rerun the Model: Run the model with patched activations and observe behavioral changes.
Analyze Results: Infer the role of specific components based on how the output changes.
There are two primary ways to apply activation patching :
Denoising Analysis involves taking a corrupted prompt, such as one where Gaussian noise has been added to key embeddings, and replacing its activations with those from a clean prompt. By observing which patched activations restore the clean behavior, researchers can identify the components that are sufficient to correct the corrupted behavior. For example, this technique can reveal layers where key information is integrated or restored during processing.
Noising Analysis, on the other hand, starts with a clean prompt and replaces its activations with those from a corrupted prompt. By determining which patches disrupt the clean behavior, this method pinpoints the components necessary to maintain the correct output. This analysis is particularly useful for identifying which layers or activations play a critical role in preserving the model’s functionality.
Example
Imagine analyzing a Vision language Model (VLM) tasked with identifying objects in an image:
Clean Input: An image of “a cat sitting on a mat.”
Corrupted Input: The same image with Gaussian noise added to the “cat” region.
Steps:
Run both inputs through the model and save activations for all layers.
Replace specific layer activations in the corrupted input with those from the clean input.
Rerun the model with these patched activations.
Observe if replacing activations in certain layers restores the ability to correctly identify “cat.”
If restoring activations in a specific layer consistently fixes errors, this suggests that layer plays a critical role in object recognition.
Key Findings from Existing Works
Visual-Linguistic Integration
Layer-Specific Processing in BLIP: Palit et al. used Gaussian noise patching to analyze BLIP’s processing patterns. They found that image information primarily influence the model’s outputs in specific layers: layer 11 of the question encoder and layers 9-11 of the answer decoder. This observation suggests two possibilities:
The model might primarily combine visual and text information in its later layers
Later layers might play a more decisive role in the final output, while earlier layers provide supporting information
Visual-to-Language Transformation in LLaVA: Neo et al. examined how LLaVA processes visual information. They found that representations at visual token positions change systematically across layers, gradually aligning with interpretable textual concepts. This indicates that VLMs can naturally transform visual information into language-like representations, even without specific visual pretraining.
Architectural Differences: Golovanevsky et al. developed a new method called Semantic Image Pairs (SIP) - a method where they make concept changes to images (e.g., changing a “cat” to a “dog”) to understand how VLMs process meaning. Their analysis revealed:
Cross-attention serves three functions: object detection, suppression, and outlier suppression
Different architectures have distinct characteristics: (1) LLaVA lacks “text-only” attention heads; (2) BLIP has no “vision-only” heads; (3) Both models use universal heads for cross-modal integration
Layer-wise Information Processing
Early vs. Late Layer Functions: Basu et al. used causal tracing to show that LLaVA primarily retrieves visual information in early layers (1-4), followed by consistent summarization in final visual tokens. Neo et al. further investigated this through attention knockout experiments, they found that:
Layers 1-10 process broader contextual information
Layers 15-24 focus on extracting specific object details
Notably, they found that blocking visual token attention to the last row had minimal impact, challenging previous theories about intermediate summarization steps
Layer Importance for Model Performance: Initial studies by Gandelsman et al. on CLIP showed that final layers have significant direct effects on model accuracy, while early layer modifications (like removing multihead attention) have minimal impact. Balasubramanian et al. later extended these findings across a broader range of Vision Transformers, confirming the critical role of the final four layers in model performance.
Analytical Tools
Recent analytical tools have significantly enhanced our understanding of VLMs. Ben et al. developed LVLM-Interpret, an interactive tool that combines attention knockout with relevancy mapping and causal graph construction to visualize information flow patterns and identify critical image regions.
Method Variants and Limitations
Variants:
Direct Ablations: A simpler variant where activations are replaced with zeros or dataset means. While zero ablation shows components critical for network behavior, mean ablation is a more natural version of zero ablation.
Path Patching: An extension that traces specific causal pathways through the network, helping understand how information flows between different model components.
Attention Knockout: A specialized form focused on analyzing attention mechanisms by selectively blocking attention patterns between tokens.
Creating Corrupted Inputs:
Text Inputs: Introduce Gaussian Noise (GN) or use Symmetric Token Replacement (STR), which replaces tokens with semantically similar alternatives. STR is often preferred as GN disrupts model internals.
Image Inputs: Apply Gaussian noise or use Semantic Image Pairs (SIP) to modify concepts (e.g., change “cat” to “dog”) .
Logit Lens
Logit lens uses the model’s unembedding matrix to extract and interpret predictions from intermediate layers, providing insights into how the model refines its understanding at each stage.
What is Logit Lens
Logit lens is an analytical method used to understand how neural networks refine their predictions layer by layer. By applying the model’s final classification layer (unembedding matrix) to intermediate activations, it projects these activations into vocabulary space. This allows researchers to analyze intermediate predictions, offering insights into the model’s evolving understanding of multimodal inputs.
How Logit Lens Works
The logit lens maps intermediate activations to a sequence of “snapshots” of predictions as they develop across the network’s layers. The process involves:
Extracting activations from each layer of the model.
Applying the unembedding matrix to transform these activations into vocabulary distributions.
Observing how prediction probabilities change from one layer to the next, revealing the model’s internal decision-making process.
Example
Consider a vision language model analyzing an image of “a dog chasing a ball in a park.” Using the logit lens, the prediction evolution might look like this:
Early layers: Predictions are highly uncertain, with terms like “dog,” “animal,” and “pet” receiving similar probabilities.
Middle layers: The model begins refining predictions, focusing on “dog” while maintaining context-related terms such as “park” and “ball.”
Final layers: The model confidently predicts specific relationships like “dog chasing ball” and integrates objects into a coherent scene.
This example illustrates how the logit lens tracks the progression from basic feature recognition to high-level understanding.
Key Findings from Existing Works
Concept Distribution Patterns
MMNeuron applies logit lens to analyze hidden states of multimodal models like LLaVA-NeXT and InstructBLIP. Through their analysis of decoded vocabulary distributions, they reveal that image tokens generate notably sparser distributions than text tokens. This observation suggests that image representations are encoded as mixtures of concepts rather than direct word mappings.
Representations Evolution
By examining the entropy of these distributions across layers, Huo et al. uncover a distinctive three-stage pattern: initial feature alignment with high entropy, followed by information processing with sharply declining entropy in middle layers, and finally token selection with slight entropy increase. More recent work further explores how representations at visual token positions evolve through the layers in LLaVa 1.5, finding that activations in the late layers at each visual token position correspond to token embeddings that describe its original patch object.
Reduce Hallucinations
Building on these insights, Jiang et al. demonstrate practical applications of the logit lens by using it to spatially localize objects and perform targeted edits to VLM’s latent representations. Their approach effectively reduces hallucinations without compromising the model’s overall performance, showcasing how understanding internal representations can lead to concrete improvements in model reliability.
Method Variants and Limitations
Limitations:
The logit lens can be brittle, as it assumes hidden states remain linearly interpretable across all layers, which may not hold for heavily tuned or non-linear models .
It is less effective for analyzing tasks requiring complex reasoning or long-term dependencies.
Sparse Autoencoders
The sparse autoencoder works by mapping input activations into a high-dimensional sparse representation using an encoder and reconstructing the original input through a decoder.
What Are Sparse Autoencoders
Sparse Autoencoders (SAEs) are a neural network-based method designed to disentangle complex internal representations in neural networks by addressing the superposition problem . In superposition, neurons encode multiple overlapping features, which makes interpretation challenging. SAEs mitigate this by mapping representations into a higher-dimensional, sparsely activated space, enabling the extraction of distinct, interpretable features.
How Sparse Autoencoders Work
SAEs use an encoder-decoder structure to transform and reconstruct input representations while enforcing sparsity. Given activations \(z \in \mathbf{R}^d\) from a neural network, the encoder transforms these into a sparse, high-dimensional representation \(h\):
The first term ensures accurate reconstruction, while the \(L_1\) norm encourages sparsity, making each dimension correspond to a distinct feature.
Example
Consider a vision language model where internal activations encode multiple concepts (e.g., visual concepts and language semantics). For instance, an activation might simultaneously encode features like “cat” (visual concept) and “playful” (language concept), making interpretation difficult. By applying a Sparse Autoencoder (SAE), these entangled representations are transformed into a high-dimensional sparse space, where each dimension uniquely captures a specific feature, such as “furry animal” or “expressive tone.”
Key Findings from Existing Works
Language Models: SAEs have been successfully applied to large language models like Claude 3 , GPT-4 and LLaMA-3.1 , enabling the discovery of distinct patterns in how these models encode syntax, semantics, and other linguistic features.
Vision Transformers (ViTs): Researchers have begun using SAEs to analyze ViTs . Early results suggest that SAEs can extract interpretable image features, such as object boundaries and textures, using less data compared to their application in language models.
However, Sparse Autoencoders have not yet been applied to vision language models.
Method Variants and Limitations
Variants:
TransCoders and CrossCoders extend SAEs by incorporating cross-layer and cross-model feature analysis, enabling comparisons both within layers and across different models.
Limitations:
Applying SAEs to large-scale models is computationally expensive due to the increased dimensionality of the sparse space.
Automated Explanation
What is Automated Explanation
Automated explanation methods aim to make neural networks more interpretable by translating their abstract representations into human-understandable concepts. Unlike traditional methods that emphasize identifying important features in the input space, automated explanations focus on uncovering the meaning behind these features. These methods minimize reliance on manual analysis, bridging the gap between mathematical representations and human intuition.
How Automated Explanation Works
There are two primary approaches to automated explanation:
Text-image space alignment: This type of method establishes connections between visual features and natural language descriptions by mapping activations into a shared semantic space. This enables the discovery of interpretable concepts that explain model behavior.
Text-image space alignment aims to find concepts to match with the model’s internal representations.
Example: TextSpan
Goal: To reveal interpretable text-labeled bases for outputs of vision encoder.
Process:
Cache vision encoder attention head outputs.
Use a predefined text bank to greedily select text descriptions that maximize explained variance.
Analyze discovered attention heads to identify interpretable properties like “color” or “counting.”
Data distribution-based analysis: This type of method explores patterns in neuron activations across diverse input types to reveal specialized neurons or components. This approach uses either supervised or unsupervised to explain the underlying distribution of neural activations.
Data distribution-based analysis uses supervised or unsupervised methods to explain a distribution of most activating examples into natural language concepts.
Example
Supervised Approaches
Supervised methods use concept-labeled data to guide the interpretation of neural network components. These methods identify components that consistently activate strongly when presented with specific input types. For example:
A neuron that activates strongly for images of cats but remains inactive for other inputs can be classified as specialized for detecting “cat” features.
While supervised approaches provide clear and verifiable concept mappings, their reliance on labeled data limits scalability and may miss concepts not included in the predefined set.
Unsupervised Approaches
Unsupervised methods take a data-driven approach, discovering meaningful patterns in neural activations without requiring labeled data. These techniques use clustering or dimensionality reduction to group similar activation patterns and identify components’ functions.
Recent advances integrate language models or vision language models to automatically generate natural language descriptions of discovered patterns, offering greater flexibility in concept discovery .
However, ensuring the meaningfulness and reliability of these concepts remains challenging for practical applications.
Key Findings from Existing Works
Automated explanation methods have led to several notable discoveries:
TextSpan: Identified specialized attention heads in vision encoders responsible for features like “color” and “counting.” This enabled targeted interventions, such as reducing spurious correlations and improving property-based image retrieval . Building upon this foundation, Balasubramanian et al. extended TEXTSPAN’s applicability beyond CLIP to ViTs. Their proposed automated representation decomposition method to analyze the computational graph generated during the forward pass. Using this method, they break down internal contributions of models into their final representation and mapping these components to CLIP space, where they then use TEXTSPAN for text-based interpretation.
Neuron Specialization:
Language-Specific Neurons: Tang et al. discovered neurons in LLMs that activate exclusively for language-related tasks.
Domain-Specific Neurons: MMNeuron revealed neurons specialized for particular domains in vision language models. Interestingly, deactivating these neurons often had minimal effect on task performance, suggesting that VLMs rely on generalized representations.
Modality-Specific Neurons: Miner further refined the methodology to find modality-specific neurons. They also reveal that modality-specific neurons are primarily concentrated in shallow layers, with most modality information remaining within its original token set.
Sparse Autoencoders (SAEs): Rao et al. used cosine similarities between decoder weights and word embeddings to map neural features to human-understandable concepts, providing interpretable sparse representations.
SpLiCE: Bhalla et al. introduced sparse mappings that align neural features with a comprehensive semantic vocabulary in the CLIP embedding space, ensuring concise yet informative concept representations.
MAIA: Shaham et al. developed an automated framework for hypothesis-driven interpretability. MAIA iteratively tested hypotheses, answering queries such as identifying neurons selective for specific features like “forested backgrounds.”
Method Variants and Limitations
Limitations:
Ensuring that discovered concepts are both meaningful and reliable for practical applications is challenging, as concepts that seem plausible to humans may not faithfully reflect internal processing mechanisms .
Many methods are developed and tested on specific architectures (e.g., CLIP, ViTs). Their applicability to other architectures remains underexplored.
Future Directions
While the above mechanistic interpretability studies have provided significant insights into how vision language models (VLMs) function, several challenges remain. This section discusses and summarizes these challenges and proposes potential directions for future research.
From Single Model to Multiple Models
Current Situation: Unlike large language models (LLMs), vision language models (VLMs) exhibit much greater heterogeneity in terms of architectures, data, and training paradigms. For instance, VLMs can differ significantly in their vision encoders , language models , and the connectors between them—ranging from simple linear layers to visual resamplers or cross-modal attention mechanisms . They also vary in their training data, which may include image-captioning datasets, visual instruction tuning data, or interleaved image-text datasets . Additionally, their training paradigms differ, such as whether they perform alignment , or whether the vision encoder is frozen or fine-tuned during training . This substantial heterogeneity may limit the transferability of findings if interpretability studies are only conducted on a single model.
Path Forward: Conducting cross-model analyses is essential to verify conclusions across different VLMs and ensure their generalizability. This approach can help identify universal principles applicable across various VLMs, as well as model-specific insights that could lead to tailored improvements.
From Small Models to Large Models
Current Situation: Current interpretability research in VLMs primarily focuses on smaller-scale models, such as those with 2B or 7B parameters. However, larger VLMs often exhibit emergent capabilities that are absent in smaller models. These new capabilities may pose unique challenges for applying interpretability tools to larger models.
Path Forward: Scaling up interpretability studies to include larger models is critical for understanding how these tools perform at scale and what new insights they might uncover . This effort can deepen our understanding of emergent behaviors and inform the development of interpretability methods suitable for larger models.
From Language-Centric to Vision-Centric
Current Situation: VLMs differ from LLMs in their handling of visual information. While many LLM interpretability tools have been successful in explaining text-based mechanisms , applying these tools directly to VLMs may not suffice due to the richer, more ambiguous nature of visual information . Furthermore, VLMs incorporate vision encoders, language models, and connectors between them, adding layers of complexity to interpretability studies.
Path Forward: Developing tools specifically designed for visual contexts is necessary to address the unique challenges posed by vision-based features. Meanwhile, these tools should consider the intricate architectures of VLMs and prioritize analyzing the vision components and vision language connectors, ensuring that interpretations are accurately attributed to the visual inputs. Additionally, input data used for interpretability should emphasize vision-centric tasks that cannot be easily solved by text-only models, ensuring meaningful insights into how VLMs process visual inputs.
From Static Processes to Dynamic Processes
Current Situation: Interpretability studies often focus on a single checkpoint of a model, ignoring the dynamic nature of information flow during training. For example, VLM training typically involves multiple stages, such as initial alignment using image-captioning data (where only the vision language connector is tuned) followed by end-to-end fine-tuning with diverse instruction-tuning data. These stages may include phase changes where models gain new capabilities or behaviors, such as transitioning from unimodal pre-trained models to multimodal systems.
Path Forward: Studying the dynamics of VLM training is crucial for uncovering novel insights. Applying interpretability tools at different checkpoints during training can shed light on phase changes and the evolution of information flow. Insights from these dynamic studies could also resonate with cognitive science research, such as experiments on restoring vision in previously blind individuals.
From Micro-Level to Macro-Level
Current Situation: Interpretability research often focuses on micro-level phenomena, such as individual neurons or layers, to understand how VLMs process information. However, these findings are rarely connected to macro-level behaviors, such as performance variations across tasks or model designs. For example, recent studies show that CLIP/SigLIP vision encoders pre-trained on image-text data outperform those trained purely on images such as DINO when building VLMs . However, the underlying reasons for these differences remain unclear. Similarly, VLMs can struggle with seemingly simple vision-centric tasks like image classification, despite their vision encoders excelling in such tasks .
Path Forward: Bridging the gap between micro-level findings and macro-level behaviors is essential for driving advancements in VLM development. Applying interpretability tools to investigate unresolved macro-level questions—such as why certain vision encoders perform better or why VLMs struggle with specific tasks—can yield actionable insights. For example, probing tools have been employed to link VLM failures on vision-centric tasks to limitations in the vision encoder. Such findings can inform the design of improved vision encoders, potentially combining the strengths of models like CLIP and DINO to overcome these shortcomings.
Conclusion
This work provides a comprehensive review of studies leveraging mechanistic interpretability tools to analyze vision language models (VLMs), including probing techniques, activation patching, logit lenses, sparse autoencoders, and automated explanation methods. These tools have greatly enhanced our understanding of how VLMs represent, integrate, and process multimodal information. Despite these advancements, several key challenges remain. These include the need for validation across a wider range of VLM architectures and training paradigms, a deeper exploration of information flow dynamics throughout training stages, and a stronger alignment between micro-level insights and macro-level behaviors. Addressing these challenges will pave the way for developing more robust and effective VLMs, advancing both their design and practical applications.
For attribution in academic contexts, please cite this work as