Understanding Teacher Forcing in Seq2Seq Models
Source: Dev.to
What is Teacher Forcing?
When training a seq2seq neural network, the decoder generates one token at a time, building the output sequence step by step. At each step it needs a previous token as input to predict the next one. The choice of this previous token directly affects how well the model learns.
Without Teacher Forcing
Without teacher forcing, the model uses its own previous prediction as input.
Example (target sequence: “I am learning”):
| Step | Input token | Predicted token |
|---|---|---|
| 1 | — | I ✅ |
| 2 | I | Is ❌ |
| 3 | Is | … (error propagates) |
A single early mistake causes all subsequent predictions to drift away from the correct sequence. Errors compound step by step, making training slow, unstable, and harder for the model to learn correct sequences.
With Teacher Forcing
With teacher forcing, instead of feeding the model’s own prediction, we feed the correct token from the dataset at every step. Even if the model makes a mistake at one step, we replace it with the correct token before moving to the next step. This ensures that the model always sees the right context while learning.
Benefits
- Faster convergence – the model receives accurate context throughout training.
- More stable training – errors do not compound across time steps.
- Easier learning of correct sequences – the decoder focuses on mapping the current hidden state to the next true token rather than trying to recover from its own mistakes.