SOM directions are better than one: multi-directional refusal suppression in language models

Piras, Giorgio
Co-prime
;
Mura, Raffaele
Co-prime
;
Brau, Fabio;Oneto, Luca;Roli, Fabio;Biggio, Battista
Last
2026-01-01

Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model’s latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.
2026
Inglese
AAAI-26 Technical Tracks 39
40
39
32728
32736
9
40th Annual AAAI Conference on Artificial Intelligence
Esperti anonimi
January 22nd - 27th
Singapore
internazionale
scientifica
no
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
Piras, Giorgio; Mura, Raffaele; Brau, Fabio; Oneto, Luca; Roli, Fabio; Biggio, Battista
273
6
4.1 Contributo in Atti di convegno
none
info:eu-repo/semantics/conferencePaper
   Cybersecurity for AI-Augmented Systems
   Sec4AI4Sec
   European Commission
   Horizon Europe Framework Programme
   101120393

   European Lighthouse on Secure and Safe AI
   ELSA
   European Commission
   Horizon Europe Framework Programme
   101070617

   A COMPREHENSIVE TRUSTWORTHY FRAMEWORK FOR CONNECTED MACHINE LEARNING AND SECURE INTERCONNECTED AI SOLUTIONS
   CoEvolution
   European Commission
   Horizon Europe Framework Programme
   101168560
Files in This Item:
There are no files associated with this item.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Questionnaire and social

Share on:
Impostazioni cookie