reward hacking

0 month ago · ai

Why AI safety should be enforced structurally, not trained in

Most current AI safety work assumes an unsafe system and tries to train better behavior into it. - We add more data. - We add more constraints. - We add more fi...

#AI safety #alignment #reinforcement learning #structural enforcement #machine learning #AI governance #reward hacking
1 month ago · ai

[Paper] EvilGenie: A Reward Hacking Benchmark

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents ...

#reward hacking #code generation #benchmark #LLM evaluation #AI safety