With the rapid development of Omni-modal large language models (OLLMs), these models demonstrate powerful capabilities in processing text, image, video and audio simultaneously. However, omni-modal inputs also bring new safety challenges. We propose OmniSafetyBench, the first comprehensive benchmark specifically designed for safety evaluation of audio-visual large language models (i.e. omni-modal large langauge models).
Omni-SafetyBench contains 23,328 carefully constructed evaluation samples, covering 24 different modality variations, from unimodal (text, image, video, audio) to complex omni-modal combinations (image+audio+text, video+audio+text). Our benchmark employs various data generation techniques including diffusion model generation, typography transformation, text-to-speech, etc., to comprehensively evaluate model safety performance under various attack scenarios.
We propose four core metrics to evaluate model safety performance from different perspectives, including Conditional Attack Success Rate (C-ASR), Conditional Refusal Rate (C-RR), Safety-score, and Cross-Modal Safety Consistency (CMSC-score). These metrics measure model safety defense capabilities when correctly understanding harmful inputs, as well as the consistency of safety performance across different modality combinations.
Our evaluation results show that current omni-modal large language models still have significant deficiencies when facing harmful inputs of complex modality combinations. With only 3 out of 10 evaluated models achieving scores exceeding 0.6 in both overall Safety-score and CMSC-score. There is still a long way to go for building truly safe and reliable omni-modal large language models.
OmniSafetyBench aims to create the first comprehensive benchmark for evaluating the safety of multimodal large language models, with particular focus on models that simultaneously support image, audio, and text inputs. Our benchmark is specifically designed to evaluate MLLM safety understanding capabilities under cross-modal complementary information, requiring models to identify potential safety risks and make appropriate responses.
The dataset contains 23,328 evaluation samples, based on 972 seed data from MM-SafetyBench and expanded through various modality conversion techniques. Our data covers 3 modality paradigms, 9 modality types and 24 different modality variations, described in Figure 1.
Figure 1: Overview of OmniSafetyBench dataset, covering 3 modality paradigms, 9 modality types and 24 different modality variations.
Illustration of our data structure is shown in Figure 2. Each seed sample is expanded into 24 modality variations through various data generation techniques including diffusion model generation, typography transformation, text-to-speech, and noise addition. This allows us to comprehensively evaluate model safety performance under different modality combinations and attack scenarios.
Figure 2: OmniSafetyBench data structure example, showing the construction of different modality variations.
We adopt four core metrics to comprehensively evaluate model safety performance across different modality combinations:
The probability of outputting harmful content given that the model understands the input. This metric measures the degree of safety defense failure when the model correctly interprets the query.
The probability of refusing to answer given that the model understands the input. This metric measures the model's caution level and tendency to avoid potentially harmful responses.
Comprehensive safety metric ranging from 0 to 1, where higher values indicate safer models. It balances attack success rate and refusal rate using the formula:
where $\lambda = 0.5$
Measures model's cross-modal safety consistency across 24 modality variants, calculated based on the standard deviation of Safety-scores across different modality combinations. Higher values indicate better consistency and robustness against cross-modal attacks.
where $\sigma$ is the standard deviation of Safety-scores across 24 modality variants
The main goal of Omni-SafetyBench is to evaluate the safety defense capabilities of omni-modal large language models when understanding and processing harmful information from multiple modalities including images, audio, video and text. We use Safety-score as the primary evaluation metric, while considering CMSC-score to measure model safety consistency across different modalities.
Click column headers to sort. Higher Safety-score indicates safer models, higher CMSC-score indicates better cross-modal consistency.
| Model Name | Type | Unimodal Safety-score |
Dual-modal Safety-score |
Omni-modal Safety-score |
Overall Safety-score |
CMSC-score (Consistency) |
Weakest Modality | Lowest Score |
|---|
@article{pan2025omni,
title={Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models},
author={Pan, Leyi and Fu, Zheyu and Zhai, Yunpeng and Tao, Shuchang and Guan, Sheng and Huang, Shiyu and Zhang, Lingzhe and Liu, Zhaoyang and Ding, Bolin and Henry, Felix and others},
journal={arXiv preprint arXiv:2508.07173},
year={2025}
}