RESEARCH

University of Florida Researchers Achieve 99 Percent AI Jailbreak Rate Using Internal Model Architecture Attacks

M megaone_admin Mar 29, 2026 2 min read
Engine Score 8/10 — Important

This story details crucial research from the University of Florida on AI safety, which has a high industry impact on all AI development and deployment. While the specific research findings may not be immediately actionable for all readers, the reliable source and significant implications make it important.

Editorial illustration for: University of Florida Researchers Achieve 99 Percent AI Jailbreak Rate Using Internal Model Archi

Researchers at the University of Florida have developed a method called Head-Masked Nullspace Steering that bypasses large language model safety guardrails by targeting the internal architecture of the models rather than manipulating prompts from the outside. The technique achieved attack success rates as high as 99 percent on some models with an average of only two attempts needed to bypass safety filters.

The research paper, “Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion,” has been accepted into the 2026 International Conference on Learning Representations, the premier global venue for deep learning research, scheduled for April 23 through 27 in Rio de Janeiro.

The method works by identifying attention heads inside language models that are responsible for safety refusal behaviors, then silencing those heads by zeroing out their contribution to the decision matrix. A steering signal is then injected into the model’s internal state within the mathematical nullspace, defined as the set of directions that the muted safety heads cannot influence. Because the silenced heads literally cannot counteract the steering signal, the model generates unsafe outputs regardless of its training.

The technique adapts in real time as the model generates text, continuously steering output away from safety-aligned responses. This closed-loop approach makes it fundamentally different from external jailbreaking methods like prompt injection or social engineering attacks, which operate at the input level and can often be patched by updating system prompts or fine-tuning.

The research team, led by Professor Sumit Kumar Jha in UF’s Department of Computer and Information Science and Engineering, used UF’s HiPerGator supercomputer to power the computational work. Collaborators include CISE doctoral student Vishal Pramanik, Maisha Maliha from the University of Oklahoma, and Susmit Jha from SRI International.

The team outperformed every existing state-of-the-art jailbreak method across four major industry benchmarks: AdvBench, HarmBench, and two others. Professor Jha described the work as defensive in intent, arguing that understanding how models can be broken at the architectural level is necessary to build more robust safety mechanisms. The findings suggest that current alignment techniques that rely on training-time guardrails may be insufficient against attacks that operate at the weight level of the model itself.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

M
MegaOne AI Editorial Team

MegaOne AI monitors 200+ sources daily to identify and score the most important AI developments. Our editorial team reviews 200+ sources with rigorous oversight to deliver accurate, scored coverage of the AI industry. Every story is fact-checked, linked to primary sources, and rated using our six-factor Engine Score methodology.

About Us Editorial Policy