RLHF from Scratch
Source: Hacker News
What the code implements (short)
src/ppo/ppo_trainer.py— a simple PPO training loop to update a language model policy.src/ppo/core_utils.py— helper routines (rollout/processing, advantage/return computation, reward wrappers).src/ppo/parse_args.py— CLI/experiment argument parsing for training runs.tutorial.ipynb— the notebook that ties the pieces together (theory, small experiments, and examples that call the code above).
What’s covered in the notebook (brief)
- RLHF pipeline overview: preference data → reward model → policy optimization.
- Short demonstrations of reward modeling, PPO‑based fine‑tuning, and comparisons.
- Practical notes and small runnable code snippets to reproduce toy experiments.
How to try
- Open
tutorial.ipynbin Jupyter and run cells interactively. - Inspect
src/ppo/to see how the notebook maps to the trainer and utilities.
If you want a shorter or more hands‑on example (e.g., a single script to run a tiny DPO or PPO demo), let me know and I can add it.