· ai
I Trained Probes to Catch AI Models Sandbagging
TL;DR: I extracted “sandbagging directions” from three open‑weight models and trained linear probes that detect sandbagging intent with 90‑96 % accuracy. The mo...
TL;DR: I extracted “sandbagging directions” from three open‑weight models and trained linear probes that detect sandbagging intent with 90‑96 % accuracy. The mo...
Mistral unveils its Mistral 3 lineup, including a frontier model and efficient small models designed for offline, customizable enterprise use—aiming to prove sm...
Article URL: https://mistral.ai/news/mistral-3 Comments URL: https://news.ycombinator.com/item?id=46121889 Points: 138 Comments: 38...