ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

👓 Introduction

Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): i) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and ii) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-$N$ to search the most harmless and helpful generation paths.

Left: USR changes from LLM backbone to VLM, and finally our ETA. Right: Pre-generation evaluator can effectively distinguish safe and unsafe images

ETA establishes a strong multimodal safety awareness and defense mechanism for VLMs without any additional training.

🌟 Motivation

LLM backbones are typically aligned on discrete textual embeddings $ \mathcal{E}_{\text{textual}} \subset \mathbb{R}^d $. In contrast, the continuous visual embeddings $ \mathcal{E}_{\text{visual}} \subset \mathbb{R}^d $ often appear away from all textual embeddings.

Continuous visual token embeddings can bypass existing safety mechanisms that are primarily aligned with discrete textual token embeddings. To verify this hypothesis, we implemented a mapping that transforms continuous visual embeddings to their nearest discrete textual embeddings based on cosine similarity. This mapping results in a significant 7% reduction in the unsafe rate (USR) when evaluated on the SPA-VL Harm test set.

📌 Results

Compared to other inference-time baselines, ETA significantly decreases responses unsafe rate on various safety benchmarks when applied to different VLM backbones [1].

Evaluating on comprehensive benchmarks and general VQA tasks, ETA preserves model's general ability. [2].

Applying ETA significantly increases the helpfulness of the generated responses, aligning closely with human preferences, even when compared to fine-tuned methods [3].

Compared to other methods, ETA does not significantly increase inference time. [3].

Ablation study on Aligning process of ETA. [4].

[1]

[2]

[3]

[4]

📋 Examples

BibTeX

@article{ding2024eta,
  title={ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time},
  author={Ding, Yi and Li, Bolian and Zhang, Ruqi},
  journal={arXiv preprint arXiv:2410.06625},
  year={2024}
}