Apple Research Exposes Critical Gaps in Multimodal AI Safety

Apple researchers have identified significant weaknesses in how current AI models handle safety when processing combined image and text inputs, revealing that state-of-the-art systems fail dramatically when required to reason jointly across visual and textual content. The study, published as part of the Vision Language Safety Understanding (VLSU) framework, was accepted at the Learning from Evaluating the Evolving LLM Lifecycle workshop at NeurIPS 2025.

Led by Shruti Palaskar and a team of twelve Apple researchers, the study constructed a comprehensive benchmark of 8,187 samples spanning 15 harm categories and 17 distinct safety patterns. The researchers used a multi-stage pipeline with real-world images and human annotation to systematically evaluate how multimodal foundation models handle safety across different content combinations.

The evaluation of eleven state-of-the-art models revealed stark performance gaps between unimodal and multimodal safety assessment. While models achieved over 90% accuracy on clear unimodal safety signals, “performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label,” according to the research paper. Most critically, the study found that 34% of errors in joint image-text safety classification occurred despite correct classification of the individual modalities.

The research also exposed a fundamental trade-off in current safety systems between over-blocking borderline content and under-refusing genuinely harmful material. The team demonstrated this tension using Gemini-1.5, where instruction framing reduced over-blocking of borderline content from 62.4% to 10.4%, but simultaneously caused the refusal rate for unsafe content to drop from 90.8% to 53.9%.

The VLSU framework addresses a critical gap in current safety evaluation approaches, which “often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination.” The research provides what the authors describe as “a critical test bed to enable the next milestones in research on robust vision-language safety,” highlighting the urgent need for improved compositional reasoning capabilities in multimodal AI systems.

Apple Research Exposes Critical Gaps in Multimodal AI Safety

Enjoyed this story?

MetaClaw Framework Trains AI Agents During Calendar Downtime

Naver Builds First Location-Grounded World Model Using 1.2 Million Seoul Street View Images

CERN Deploys Sub-200 Nanosecond AI Models on FPGAs to Filter 40 Million Collisions Per Second at the LHC

Before you go…