· ai
I Trained Probes to Catch AI Models Sandbagging
TL;DR: I extracted “sandbagging directions” from three open‑weight models and trained linear probes that detect sandbagging intent with 90‑96 % accuracy. The mo...
TL;DR: I extracted “sandbagging directions” from three open‑weight models and trained linear probes that detect sandbagging intent with 90‑96 % accuracy. The mo...