linear probes | EUNO.NEWS

2시간 전 · ai

AI 모델의 샌드백킹을 잡기 위해 프로브를 훈련시켰다

TL;DR: 나는 세 개의 오픈‑웨이트 모델에서 “sandbagging directions”를 추출하고, sandbagging 의도를 90‑96 % 정확도로 감지하는 linear probes를 훈련시켰다. The mo...

#sandbagging #model probing #linear probes #AI safety #Mistral #Gemma #evaluation gaming #model steering