RLHF from Scratch

Published: 3 days ago (February 10, 2026 at 06:39 AM EST)

1 min read

Source: Hacker News

Source: Hacker News

What the code implements (short)

src/ppo/ppo_trainer.py — a simple PPO training loop to update a language model policy.
src/ppo/core_utils.py — helper routines (rollout/processing, advantage/return computation, reward wrappers).
src/ppo/parse_args.py — CLI/experiment argument parsing for training runs.
tutorial.ipynb — the notebook that ties the pieces together (theory, small experiments, and examples that call the code above).

What’s covered in the notebook (brief)

RLHF pipeline overview: preference data → reward model → policy optimization.
Short demonstrations of reward modeling, PPO‑based fine‑tuning, and comparisons.
Practical notes and small runnable code snippets to reproduce toy experiments.

How to try

Open tutorial.ipynb in Jupyter and run cells interactively.
Inspect src/ppo/ to see how the notebook maps to the trainer and utilities.

If you want a shorter or more hands‑on example (e.g., a single script to run a tiny DPO or PPO demo), let me know and I can add it.

Repository links

Related posts

[Paper] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA...

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iterati...

[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing r...

[Paper] Agentic Test-Time Scaling for WebAgents

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step ...