Researchers at the University of Florida have developed a method called Head-Masked Nullspace Steering that bypasses large language model safety guardrails by targeting the internal architecture of the models rather than manipulating prompts from the outside. The technique achieved attack success rates as high as 99 percent on some models with an average of only two attempts needed to bypass safety filters.
The research paper, “Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion,” has been accepted into the 2026 International Conference on Learning Representations, the premier global venue for deep learning research, scheduled for April 23 through 27 in Rio de Janeiro.
The method works by identifying attention heads inside language models that are responsible for safety refusal behaviors, then silencing those heads by zeroing out their contribution to the decision matrix. A steering signal is then injected into the model’s internal state within the mathematical nullspace, defined as the set of directions that the muted safety heads cannot influence. Because the silenced heads literally cannot counteract the steering signal, the model generates unsafe outputs regardless of its training.
The technique adapts in real time as the model generates text, continuously steering output away from safety-aligned responses. This closed-loop approach makes it fundamentally different from external jailbreaking methods like prompt injection or social engineering attacks, which operate at the input level and can often be patched by updating system prompts or fine-tuning.
The research team, led by Professor Sumit Kumar Jha in UF’s Department of Computer and Information Science and Engineering, used UF’s HiPerGator supercomputer to power the computational work. Collaborators include CISE doctoral student Vishal Pramanik, Maisha Maliha from the University of Oklahoma, and Susmit Jha from SRI International.
The team outperformed every existing state-of-the-art jailbreak method across four major industry benchmarks: AdvBench, HarmBench, and two others. Professor Jha described the work as defensive in intent, arguing that understanding how models can be broken at the architectural level is necessary to build more robust safety mechanisms. The findings suggest that current alignment techniques that rely on training-time guardrails may be insufficient against attacks that operate at the weight level of the model itself.
