EUNO.NEWS EUNO.NEWS
  • All (12099) +26
  • AI (1935) +5
  • DevOps (577)
  • Software (6295) +21
  • IT (3262)
  • Education (30)
  • Notice
  • All (12099) +26
    • AI (1935) +5
    • DevOps (577)
    • Software (6295) +21
    • IT (3262)
    • Education (30)
  • Notice
  • All (12099) +26
  • AI (1935) +5
  • DevOps (577)
  • Software (6295) +21
  • IT (3262)
  • Education (30)
  • Notice
Sources Tags Search
한국어 English 中文
  • 7 hours ago · ai

    I Trained Probes to Catch AI Models Sandbagging

    TL;DR: I extracted “sandbagging directions” from three open‑weight models and trained linear probes that detect sandbagging intent with 90‑96 % accuracy. The mo...

    #sandbagging #model probing #linear probes #AI safety #Mistral #Gemma #evaluation gaming #model steering
EUNO.NEWS
RSS GitHub © 2025