[Paper] It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Published: 3 days ago (June 9, 2026 at 10:44 AM EDT)

1 min read

Source: arXiv

Source: arXiv - 2606.10931v1

Overview

Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Naihao Deng
Yilun Zhu
Naichen Shi
Clayton Scott
Rada Mihalcea

Paper Information

arXiv ID: 2606.10931v1
Categories: cs.CL
Published: June 9, 2026
PDF: Download PDF

[Paper] It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

[Paper] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents