Franciscu Sedda

UniCa About Professors and Researchers Franciscu Sedda Research Research outcomes (IRIS)

Franciscu Sedda

SOM directions are better than one: multi-directional refusal suppression in language models

Piras, Giorgio^Co-prime;Mura, Raffaele^Co-prime;Brau, Fabio;Oneto, Luca;Roli, Fabio;Biggio, Battista^Last

2026-01-01

Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model’s latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

Short Card

Tab complete

Full Sheet(DC)

         Anno 
       
        2026 
       
         Lingua/e 
       
        Inglese 
       
         Titolo del Volume 
       
        AAAI-26 Technical Tracks 39 
       
         Titolo della Collana/serie 
       
        PROCEEDINGS OF THE ... AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE 
       
         Volume 
       
        40 
       
         Fascicolo 
       
        39 
       
         Da pagina 
       
        32728 
       
         A pagina 
       
        32736 
       
         Numero di pagine 
       
        9 
       
         Codice DOI 
       
        https://dx.doi.org/10.1609/aaai.v40i39.40551 
       
         Titolo del convegno 
       
        40th Annual AAAI Conference on Artificial Intelligence 
       
         Referee 
       
        Esperti anonimi 
       
         Periodo del Convegno 
       
        January 22nd - 27th 
       
         Luogo del Convegno 
       
        Singapore 
       
         Rilevanza del Convegno 
       
        internazionale 
       
         Caratterizzazione prevalente 
       
        scientifica 
       
         Presenza di coautori internazionali 
       
        no 
       
         Tipologia 
       
        4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno 
       
         Tutti gli autori 
       
        Piras, Giorgio; Mura, Raffaele; Brau, Fabio; Oneto, Luca; Roli, Fabio; Biggio, Battista
         
         Tipologia sito docente 
       
        273 
       
         Numero autori 
       
        6 
       
         Tipologia 
       
        4.1 Contributo in Atti di convegno 
       
         Fulltext 
       
        none 
       
         Tipologia 
       
        info:eu-repo/semantics/conferencePaper 
       
         Titolo del progetto 
       
           Project Title  Cybersecurity for AI-Augmented Systems 
         
           Acronym  Sec4AI4Sec 
         
           Funder Name  European Commission 
         
           Funding Stream  Horizon Europe Framework Programme 
         
           Award Number  101120393 
         
           Project Title  European Lighthouse on Secure and Safe AI 
         
           Acronym  ELSA 
         
           Funder Name  European Commission 
         
           Funding Stream  Horizon Europe Framework Programme 
         
           Award Number  101070617 
         
           Project Title  A COMPREHENSIVE TRUSTWORTHY FRAMEWORK FOR CONNECTED MACHINE LEARNING AND SECURE INTERCONNECTED AI SOLUTIONS 
         
           Acronym  CoEvolution 
         
           Funder Name  European Commission 
         
           Funding Stream  Horizon Europe Framework Programme 
         
           Award Number  101168560

Files in This Item:

There are no files associated with this item.

University of Cagliari

University of Cagliari

Franciscu Sedda

SOM directions are better than one: multi-directional refusal suppression in language models

Piras, Giorgio^Co-prime;Mura, Raffaele^Co-prime;Brau, Fabio;Oneto, Luca;Roli, Fabio;Biggio, Battista^Last

Co-prime

Co-prime

Last

2026-01-01

Abstract

Short Card

Tab complete

Full Sheet(DC)

Franciscu Sedda

SOM directions are better than one: multi-directional refusal suppression in language models

Piras, Giorgio Co-prime;Mura, RaffaeleCo-prime;Brau, Fabio;Oneto, Luca;Roli, Fabio;Biggio, BattistaLast

Co-prime

Co-prime

Last

2026-01-01

Abstract

Short Card Tab complete Full Sheet(DC)

Questionnaire and social

Piras, Giorgio^Co-prime;Mura, Raffaele^Co-prime;Brau, Fabio;Oneto, Luca;Roli, Fabio;Biggio, Battista^Last

Short Card

Tab complete

Full Sheet(DC)