MIT's new fine-tuning method lets LLMs learn new skills without losing old ones
Source: VentureBeat
Introduction
When enterprises fine‑tune large language models (LLMs) for new tasks, they risk breaking everything the models already know. This forces companies to maintain separate models for every skill.
Researchers at MIT, the Improbable AI Lab, and ETH Zurich have developed a new technique that enables LLMs to learn new skills and knowledge without forgetting their past capabilities.
Their technique, called self‑distillation fine‑tuning (SDFT), allows models to learn directly from demonstrations and their own experiments by leveraging the inherent in‑context learning abilities of modern LLMs. Experiments show that SDFT consistently outperforms traditional supervised fine‑tuning (SFT) while addressing the limitations of reinforcement‑learning (RL) algorithms.
For enterprise applications, the method enables a single model to accumulate multiple skills over time without suffering from performance regression on earlier tasks. This offers a potential pathway for building AI agents that can adapt to dynamic business environments, gathering new proprietary knowledge and skills as needed without requiring expensive retraining cycles or losing their general reasoning abilities.
The Challenge of Continual Learning
- Static deployment – Once an LLM is trained and deployed, its parameters remain fixed; it cannot acquire new skills, internalize fresh knowledge, or improve from experience.
- Continual learning – To build truly adaptive AI, the industry must solve “continual learning,” allowing systems to accumulate knowledge much like humans do throughout their careers.
On‑Policy vs. Off‑Policy Learning
| Aspect | On‑Policy Learning | Off‑Policy Learning (SFT) |
|---|---|---|
| Source of data | Model generates its own data (self‑generated attempts) | Fixed dataset of expert demonstrations |
| Error correction | Learns from its own mistakes | Mimics static examples, limited self‑correction |
| Catastrophic forgetting | Mitigated (learning loop) | Prone to severe forgetting |
| Reward signal | Typically requires RL with explicit reward function | No reward needed, but lacks adaptability |
On‑policy learning is the most effective way for models to improve because it lets them correct their own errors and reasoning processes. However, it usually relies on RL, which depends on an explicit reward function. Defining such a function is straightforward for tasks with clear outcomes (e.g., math, coding) but difficult or impossible for many enterprise scenarios (e.g., drafting a legal brief, summarizing a meeting).
RL also struggles when teaching a model entirely new information—such as a specific company protocol or a new product line. As Idan Shenfeld, a Ph.D. student at MIT and co‑author of the paper, told VentureBeat:
“No matter how many times the base model tries, it cannot generate correct answers for a topic it has zero knowledge about,” meaning it never receives a positive signal to learn from.
The standard alternative, supervised fine‑tuning (SFT), provides clear ground truth but is inherently off‑policy. Because the model merely mimics data rather than learning from its own attempts, it often fails to generalize to out‑of‑distribution examples and suffers heavily from catastrophic forgetting.
SDFT seeks to bridge this gap: it enables the benefits of on‑policy learning using only prerecorded demonstrations, without needing a reward function.
How SDFT Works
SDFT solves the problem by employing distillation, a process where a student model learns to mimic a teacher. The researchers’ key insight was to use the model’s own in‑context learning (ICL) capabilities to create a feedback loop within a single model.
In‑Context Learning (ICL)
- Provide the LLM with a difficult task and one or more demonstrations of how similar problems are solved.
- The model solves new problems using these examples without any parameter updates.
Training Cycle
- Teacher – A frozen copy of the model receives the query plus expert demonstrations. Using ICL, the teacher deduces the correct answer and the reasoning steps required.
- Student – A trainable copy sees only the query, mimicking a real‑world deployment where no answer key is available.
- Feedback Loop –
- The student generates an answer.
- The teacher (with access to demonstrations) evaluates the answer and provides a distributional target.
- The student updates its parameters to align more closely with the teacher’s output.
This process creates an on‑policy learning loop that blends elements of SFT and RL: supervision comes from the model’s own interaction rather than a static dataset, and no external reward signal is required. It also works for new knowledge that RL would miss.
SDFT in Action
To validate the approach, the researchers tested SDFT using the open‑weight Qwen 2.5 model on three complex, enterprise‑grade skills:
| Skill | Description |
|---|---|
| Science Q&A | Answering scientific questions accurately. |
| Software tool use | Interacting with and reasoning about software utilities. |
| Medical reasoning | Providing clinically relevant answers. |
Quantitative Results
-
Science Q&A benchmark:
- SDFT – 70.2 % accuracy
- Standard SFT – 66.2 % accuracy
-
Catastrophic forgetting (measured on “Previous Tasks” – general logic/humanities questions):
- Standard SFT – performance collapsed after learning the science task.
- SDFT – science performance improved while maintaining a steady 64.5 % score on previous tasks.
These results suggest companies could specialize a single model for specific departments (e.g., HR or Legal) without degrading its basic common‑sense or reasoning capabilities.
Knowledge Injection Experiment
The team simulated a knowledge‑injection scenario:
-
Created a dataset of fictional “2025 Natural Disasters.”
-
Trained the model to incorporate these new facts.
-
Tested on indirect reasoning questions such as:
“Given the floods in 2025, which countries are most likely to experience severe agricultural losses?”
The SDFT‑trained model successfully answered these queries, demonstrating that new factual knowledge can be injected without a reward function and without harming existing competencies.
Takeaways
- SDFT merges the strengths of on‑policy learning with the practicality of using only pre‑recorded demonstrations.
- It outperforms traditional supervised fine‑tuning on new tasks while preserving performance on previously learned tasks.
- For enterprises, SDFT offers a single‑model strategy to continuously acquire new skills and proprietary knowledge, reducing the need for multiple specialized models and costly retraining cycles.
Prepared by the AI assistant – cleaned and formatted for easy reading.
SDFT Overview
Standard SFT resulted in a model that memorized facts but struggled to use them in reasoning scenarios.
The SDFT model, having internalized the logic during training, scored 98 % on the same questions.
Sequential Learning Experiment
- Setup: The model was trained sequentially on science, tool use, and medical tasks.
- Standard Model: Performance oscillated, losing previous skills as it learned new ones.
- SDFT Model: Successfully accumulated all three skills without regression.
“We offer the ability to maintain only a single model for all the company’s needs,” Shenfeld said.
This consolidation can lead to a substantial reduction in inference costs because organizations don’t need to host multiple models simultaneously.
SDFT Limitations and Availability
- Code: Available on GitHub and ready for integration into existing training pipelines.
- RL‑like Pipeline: “The SDFT pipeline is more similar to the RL pipeline in that it requires online response generation during training,” Shenfeld noted.
- Integration: A pull request is open to add SDFT support to Hugging Face’s Transformer Reinforcement Learning (TRL) library.
Practical Trade‑offs
| Factor | Details |
|---|---|
| Model Size | Requires models with strong in‑context learning (≈ 4 B parameters, e.g., Qwen 3). Expectation that 1 B‑parameter models will work soon. |
| Compute Cost | Roughly 2.5 × the compute of standard fine‑tuning. |
| Speed | Approximately four times slower than standard fine‑tuning because the model must generate its own answers (“rollouts”) during training. |
| Knowledge Retention | Better retention reduces the need for costly multi‑stage retraining to fix catastrophic forgetting. |
| Small‑Model Performance | Models “Lifelong learning, together with the ability to extract learning signal from unstructured user interactions… will bring models that just keep and keep improving with time,” Shenfeld said. |
“Think about the fact that already the majority of compute around the world goes into inference instead of training. We have to find ways to harness this compute to improve our models.”
Takeaway
SDFT offers a single‑model, multi‑skill solution that can lower inference costs and mitigate catastrophic forgetting, at the expense of higher training compute and slower iteration speed. As model architectures become more capable at smaller scales, the technique is expected to become accessible to a broader range of organizations.