Omni-SafetyBench

A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan*1,2, Zheyu Fu*1, Yunpeng Zhai2, Shuchang Tao2, Sheng Guan1,
Shiyu Huang4, Lingzhe Zhang2,3, Zhaoyang Liu2, Bolin Ding2, Felix Henry4,
Aiwei Liu†1 Lijie Wen†1,

1Tsinghua University, 2Tongyi Lab, 3Peking University, 4OpenRL Lab

*Equal Contribution, †Corresponding Authors
Email: panly24@mails.tsinghua.edu.cn, liuaiwei20@gmail.com, wenlj@tsinghua.edu.cn

Introduction

With the rapid development of Omni-modal large language models (OLLMs), these models demonstrate powerful capabilities in processing text, image, video and audio simultaneously. However, omni-modal inputs also bring new safety challenges. We propose OmniSafetyBench, the first comprehensive benchmark specifically designed for safety evaluation of audio-visual large language models (i.e. omni-modal large langauge models).

Omni-SafetyBench contains 23,328 carefully constructed evaluation samples, covering 24 different modality variations, from unimodal (text, image, video, audio) to complex omni-modal combinations (image+audio+text, video+audio+text). Our benchmark employs various data generation techniques including diffusion model generation, typography transformation, text-to-speech, etc., to comprehensively evaluate model safety performance under various attack scenarios.

We propose four core metrics to evaluate model safety performance from different perspectives, including Conditional Attack Success Rate (C-ASR), Conditional Refusal Rate (C-RR), Safety-score, and Cross-Modal Safety Consistency (CMSC-score). These metrics measure model safety defense capabilities when correctly understanding harmful inputs, as well as the consistency of safety performance across different modality combinations.

Our evaluation results show that current omni-modal large language models still have significant deficiencies when facing harmful inputs of complex modality combinations. With only 3 out of 10 evaluated models achieving scores exceeding 0.6 in both overall Safety-score and CMSC-score. There is still a long way to go for building truly safe and reliable omni-modal large language models.

Benchmark Overview

OmniSafetyBench aims to create the first comprehensive benchmark for evaluating the safety of multimodal large language models, with particular focus on models that simultaneously support image, audio, and text inputs. Our benchmark is specifically designed to evaluate MLLM safety understanding capabilities under cross-modal complementary information, requiring models to identify potential safety risks and make appropriate responses.

The dataset contains 23,328 evaluation samples, based on 972 seed data from MM-SafetyBench and expanded through various modality conversion techniques. Our data covers 3 modality paradigms, 9 modality types and 24 different modality variations, described in Figure 1.

Data Structure Example

Figure 1: Overview of OmniSafetyBench dataset, covering 3 modality paradigms, 9 modality types and 24 different modality variations.

Illustration of our data structure is shown in Figure 2. Each seed sample is expanded into 24 modality variations through various data generation techniques including diffusion model generation, typography transformation, text-to-speech, and noise addition. This allows us to comprehensively evaluate model safety performance under different modality combinations and attack scenarios.

Data Structure Example

Figure 2: OmniSafetyBench data structure example, showing the construction of different modality variations.

Evaluation Metrics

We adopt four core metrics to comprehensively evaluate model safety performance across different modality combinations:

C-ASR (Conditional Attack Success Rate)

The probability of outputting harmful content given that the model understands the input. This metric measures the degree of safety defense failure when the model correctly interprets the query.

C-RR (Conditional Refusal Rate)

The probability of refusing to answer given that the model understands the input. This metric measures the model's caution level and tendency to avoid potentially harmful responses.

Safety-score

Comprehensive safety metric ranging from 0 to 1, where higher values indicate safer models. It balances attack success rate and refusal rate using the formula:

$$\text{Safety-score} = \frac{(1 - \text{C-ASR})(1 + \lambda \cdot \text{C-RR})}{1 + \lambda},$$

where $\lambda = 0.5$

CMSC-score

Measures model's cross-modal safety consistency across 24 modality variants, calculated based on the standard deviation of Safety-scores across different modality combinations. Higher values indicate better consistency and robustness against cross-modal attacks.

$$\text{CMSC-score} = e^{-\alpha \cdot \sigma},$$

where $\sigma$ is the standard deviation of Safety-scores across 24 modality variants

Leaderboard

The main goal of Omni-SafetyBench is to evaluate the safety defense capabilities of omni-modal large language models when understanding and processing harmful information from multiple modalities including images, audio, video and text. We use Safety-score as the primary evaluation metric, while considering CMSC-score to measure model safety consistency across different modalities.

Open-source Models Closed-source Models

Click column headers to sort. Higher Safety-score indicates safer models, higher CMSC-score indicates better cross-modal consistency.

Model Name Type Unimodal
Safety-score
Dual-modal
Safety-score
Omni-modal
Safety-score
Overall
Safety-score
CMSC-score
(Consistency)
Weakest Modality Lowest Score

Key Findings

πŸ” Key Finding 1

  • The overall safety performance of current OLLMs is unsatisfactory, with significant challenges in achieving both strong overall safety performance and cross-modal safety consistency.

πŸ” Key Finding 2

  • Safety performance of OLLMs weakens sharply with complex modality combinations, where audio-visual joint inputs prove most effective at triggering vulnerabilities in most models.

πŸ” Key Finding 3

  • The most vulnerable modality variation for each tested OLLM reveals significant weaknesses.

πŸ” Key Finding 4

  • The safety features of various OLLMs differ significantly. This includes not only their overall safety characteristics but also their defensive tendencies against harmful content across different modalities and variations.

Citation

@article{pan2025omni,
  title={Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models},
  author={Pan, Leyi and Fu, Zheyu and Zhai, Yunpeng and Tao, Shuchang and Guan, Sheng and Huang, Shiyu and Zhang, Lingzhe and Liu, Zhaoyang and Ding, Bolin and Henry, Felix and others},
  journal={arXiv preprint arXiv:2508.07173},
  year={2025}
}