Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models

1Shanghai Artificial Intelligence Laboratory 2Tianjin University
*Equal contribution Corresponding author

🔔 News

[2025-01-31] 🧨 Our paper: Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models is released now! 🧨

[2025-01-30] 🧨 We release our checkpoint fine-tuned with MIS training set by MIRage🧨

[2025-01-30] 🧨 Introducing MIS, a multi-image safety dataset, including 4k training samples and 2185 test samples! 🧨

🌟 Introduction

our motivation

Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal approaches, often fall short in addressing challenging cases or result in a breakdown of the balance between helpfulness and harmlessness. Our evaluation highlights a critical gap: these methods lack the advanced visual reasoning capabilities necessary for complex safety scenarios, going beyond basic visual perception. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought reasoning as fine-grained labels to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, comprising 4000 training samples and 2185 testing samples. Our experiments demonstrate that fine-tuning VLMs with MIS significantly outperforms both powerful open-source models and API-based models in challenging safety tasks requiring visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, MIS fine-tuning increases average accuracy by 0.83\% across five general benchmarks and reduces the Attack Success Rate (ASR) on the MIS test set by 84.18\% for the InternVL2.5-8B model.

Bottlenecks in Safety Fine-Tuning of Vision Language Models

🤔 Bottlenecks

Key Bottlenecks in Vision Language Models Safety Alignment:

  • Failure in Trade-off: Existing methods often result in a breakdown of the crucial balance between model helpfulness and harmlessness.
  • Advanced Reasoning Deficiency: Our evaluation reveals a critical gap in advanced visual reasoning capabilities, which are essential for handling complex safety scenarios beyond basic visual perception.

image 1
Figure 1: The trade-off between helpfulness and harmlessness in current safety fine-tuning methods.
image 2
Figure 2: Oversafe response demonstrated by simple rejection techniques on visual information.

Multi-Image Safety Dataset

Safety categories

MIS test set contains 6 categories and 12 sub-categories of safety scenarios.

ViewNeTI pull figure and sample novel view synthesis results.

Data distribution

Detailed data statistics for MIS test set with ratio.

ViewNeTI pull figure and sample novel view synthesis results.

📚 Pipeline

Construction pipeline

Construction pipeline of the MIS dataset. It contains four steps:

  1. Step 1. Harmful element extraction.
  2. Step 2. Text instruction generation, refinement, and detoxification.
  3. Step 3. Auto-refinement T2I generation.
  4. Step 4. Multi-expert filtering to obtain 4 subsets.

🖼️ Examples of MIS Test Set

We provide six samples, each representing one of the six categories in the MIS dataset. Each sample consists of two images and a textual instruction. The textual instructions in the test set are benign. Based on the presence of harmful content in the images, the samples are categorized into test easy and test hard. In test hard, the harmful intent is only fully realized when the textual instruction and the image content are combined.

👀 Examples of MIS Training Set

ViewNeTI pull figure and sample novel view synthesis results.

In our training set, the labels are generated using a method similar to Chain-of-Thought (CoT), where InternVL2.5-78B first identifies the content in the images, then analyzes the potential hazards, and ultimately provides a safe response. In the examples we provide, the blue text corresponds to image content perception, the green text represents safety visual reasoning, and the orange text indicates the final safe response.

Experiment Results

🏆 Leaderboard


Open-Sourced Models API-based models VLMs + MIRage

Name Test Easy Test Hard Test Real
Name ASR HR RSR RR ASR HR RSR RR ASR HR RSR RR

💪 More Results of MIRage

We present additional results of MIRage on both helpfulness and safety benchmarks. The results demonstrate that MIRage significantly outperforms other fine-tuning methods across various safety tasks, while preserving general capabilities on standard benchmarks, achieving minimal trade-offs between harmlessness and helpfulness.

📌 Examples

BibTeX

@article{ding2025rethinking,
  title={Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models},
  author={Ding, Yi and Li, Lijun and Cao, Bing and Shao, Jing},
  journal={arXiv preprint arXiv:2501.18533},
  year={2025}
}