New method could increase LLM training efficiency
Source: MIT News - AI
Reasoning Large Language Models (LLMs)
Reasoning LLMs are designed to solve complex problems by breaking them down into a series of smaller steps. These powerful models excel at challenging tasks such as advanced programming and multistep planning.
However, developing reasoning models demands an enormous amount of computation and energy because of inefficiencies in the training process. While a few high‑power processors continuously work through complicated queries, many others sit idle.
Researchers from MIT and elsewhere have found a way to use this computational downtime to efficiently accelerate reasoning‑model training.
How It Works
- Smaller, faster model (drafter) – Trained automatically to predict the outputs of the larger reasoning LLM.
- Verification – The larger model checks the drafter’s predictions.
- Reduced workload – The reasoning model does less work, speeding up training.
The smaller model is trained and deployed adaptively, kicking in only when some processors are idle. By leveraging resources that would otherwise be wasted, training speeds up without additional overhead.
When tested on multiple reasoning LLMs, the method doubled the training speed while preserving accuracy. This could reduce cost and increase energy efficiency for advanced LLM applications such as forecasting financial trends or detecting risks in power grids.
“People want models that can handle more complex tasks. But if that is the goal of model development, then we need to prioritize efficiency. We found a lossless solution to this problem and then developed a full‑stack system that can deliver quite dramatic speedups in practice,” says Qinghao Hu, an MIT postdoc and co‑lead author of a paper on this technique.
Hu is joined on the paper by co‑lead author Shang Yang (EECS graduate student), Junxian Guo (EECS graduate student), senior author Song Han (associate professor in EECS, member of the Research Laboratory of Electronics, and distinguished scientist of NVIDIA), as well as collaborators at NVIDIA, ETH Zurich, the MIT‑IBM Watson AI Lab, and the University of Massachusetts Amherst. The research will be presented at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems.
Training Bottleneck
Developers want reasoning LLMs to identify and correct mistakes in their critical‑thinking process, enabling them to handle queries that would trip up a standard LLM.
To teach this skill, developers use reinforcement learning (RL):
- The model generates multiple potential answers to a query.
- It receives a reward for the best candidate.
- The model is updated based on that top answer.
These steps repeat thousands of times as the model learns.
The Problem
- Rollout (generating multiple answers) can consume up to 85 % of the execution time needed for RL training.
- Updating the model—the actual “training” part—takes comparatively little time.
“Updating the model — which is the actual ‘training’ part — consumes very little time by comparison,” Hu says.
In standard RL, all processors must finish their responses before moving on. If some processors are handling long responses, others that finished early wait idle.
“Our goal was to turn this idle time into speedup without any wasted costs,” Hu adds.
Speculative Decoding
The researchers turned to speculative decoding, which involves:
- Training a smaller “drafter” model to rapidly guess the larger model’s future outputs.
- Having the larger model verify those guesses.
- Using the accepted guesses for training.
Because the larger model can verify many guesses at once, the process accelerates.
An Adaptive Solution: “Taming the Long Tail” (TLT)
Speculative decoding traditionally uses a static drafter, trained once and left unchanged. This doesn’t work for RL, where the reasoning model is updated thousands of times; a static drafter quickly becomes stale.
TLT Components
-
Adaptive Drafter Trainer
- Uses idle processor time to train the drafter on the fly, keeping it aligned with the target model without extra computational resources.
-
Adaptive Rollout Engine
- Manages speculative decoding, automatically selecting the optimal strategy for each new batch of inputs.
- Adjusts configuration based on workload features (e.g., number of inputs processed by the drafter vs. those accepted by the target model).
The drafter is deliberately lightweight, enabling rapid training. TLT also reuses components of the reasoning‑model training pipeline, gaining extra acceleration.
“As soon as some processors finish their short queries and become idle, we immediately switch them to do draft model training using the same data they are using for the rollout process. The key mechanism is our adaptive speculative decoding — these gains wouldn’t be possible without it,” Hu says.
Results
- Tested on multiple reasoning LLMs with real‑world datasets.
- Training speed increased by 70 % to 210 % while preserving model accuracy.
- The small drafter model can also be repurposed for efficient deployment as a free byproduct.
Looking Ahead
TLT demonstrates that leveraging idle computational resources can dramatically speed up reasoning‑model training without sacrificing performance. This approach promises lower costs and higher energy efficiency for future LLM development.
Integration and Future Directions
The researchers aim to integrate TLT into a broader range of training and inference frameworks and to discover new reinforcement‑learning applications that could benefit from acceleration using this approach.
“As reasoning continues to become the major workload driving the demand for inference, Qinghao’s TLT is great work to cope with the computation bottleneck of training these reasoning models. I think this method will be very helpful in the context of efficient AI computing,” — Song Han
Funding Sources
- MIT‑IBM Watson AI Lab
- MIT AI Hardware Program
- MIT Amazon Science Hub
- Hyundai Motor Company
- National Science Foundation