Understanding Reinforcement Learning with Human Feedback Part 1: Pre-Training Large Language Models

Published: 3 weeks ago (May 18, 2026 at 03:48 PM EDT)

2 min read

Source: Dev.to

Pre‑training a decoder‑only transformer

Reinforcement Learning with Human Feedback (RLHF) is one of the techniques used to help train large language models like ChatGPT.
To build a model like ChatGPT from scratch, we first need to understand how to train an untrained decoder‑only transformer. At this stage the model’s weights and biases are initialized randomly, so it does not yet understand language or meaning.

The first step in training a large language model is to teach it to predict the next token using a very large body of text (e.g., Wikipedia articles). We take segments of text, feed the earlier words as input tokens, and train the model to predict the next token in the sequence.

Example

Input: “The cat sat on the …”
The model learns to predict the most likely next word.

By repeating this process across massive amounts of text, the model gradually learns:

Grammar
Sentence structure
Facts and patterns in language

This training stage is called pre‑training. Over time, it produces a pretrained model that is good at predicting the next token in text.

Why next‑token prediction isn’t enough for chat

Although the pretrained model excels at next‑token prediction, this ability alone does not make it suitable for answering questions or holding a conversation. For instance, being good at continuing Wikipedia text does not automatically mean the model will give helpful, safe, or conversational responses.

To make the model useful for chat, we need to align it with human expectations. This alignment is the focus of RLHF and will be explored in the next article.

Understanding Reinforcement Learning with Human Feedback Part 1: Pre-Training Large Language Models

Pre‑training a decoder‑only transformer

Example

Why next‑token prediction isn’t enough for chat

Related posts

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

Prompt Engineering: How to Get Better Results From AI (Without Writing More Prompts)

RLHF trained Claude to be verbose. Here's the proof

How to Optimize LLM Inference with KV Caching