linear probes

7 hours ago · ai

I Trained Probes to Catch AI Models Sandbagging

TL;DR: I extracted “sandbagging directions” from three open‑weight models and trained linear probes that detect sandbagging intent with 90‑96 % accuracy. The mo...

#sandbagging #model probing #linear probes #AI safety #Mistral #Gemma #evaluation gaming #model steering