New technique makes AI models leaner and faster while they’re still learning

Published: 3 weeks ago (April 9, 2026 at 09:00 AM EDT)

5 min read

Source: MIT News - AI

Overview

Training a large artificial‑intelligence model is expensive—not only in dollars, but also in time, energy, and computational resources. Traditionally, obtaining a smaller, faster model either requires training a massive one first and then trimming it down, or training a small one from scratch and accepting weaker performance.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), the Max Planck Institute for Intelligent Systems, the European Laboratory for Learning and Intelligent Systems, ETH, and Liquid AI have now developed a new method that sidesteps this trade‑off entirely: compressing models during training, rather than after.

The Technique: CompreSSM

Paper: CompreSSM (arXiv 2510.02823)
Target architectures: State‑space models (SSMs), which power language processing, audio generation, robotics, and more.
Core idea: Borrow mathematical tools from control theory to identify “heavy‑weight” versus “dead‑weight” components early in training, then surgically remove the unnecessary parts.

“It’s essentially a technique to make models grow smaller and faster as they are training,” says Makram Chahine, PhD student (EE & CS), CSAIL affiliate and lead author. “During learning, they’re also getting rid of parts that are not useful to their development.”

How It Works

Early‑stage ranking – After only ~10 % of the training process, the relative importance of each internal state stabilizes.
Hankel singular values – These quantify how much each state contributes to the model’s overall behavior.
Pruning – Dimensions with low singular values are discarded; the remaining 90 % of training proceeds with a much smaller model.

“What’s exciting about this work is that it turns compression from an afterthought into part of the learning process itself,” says senior author Daniela Rus, MIT professor and director of CSAIL.

Empirical Results

Benchmark	Baseline (full size)	Compressed (≈¼ size)	Speed‑up
CIFAR‑10 (image classification)	81.8 % accuracy	85.7 % accuracy	Up to 1.5× faster
Mamba (state‑space architecture)	128‑dim model	12‑dim model	≈4× training speedup, competitive performance

Key observation: The compressed model retains the performance of the larger model because most complex dynamics are captured during the warm‑up phase.
Comparison to alternatives:
- Hankel nuclear‑norm regularization: >40× slower and lower accuracy (requires eigenvalue computation each gradient step).
- Knowledge distillation: Requires a full‑size “teacher” model plus a “student” model, effectively doubling training cost; distilled small models train slower than the full‑size baseline and suffer larger accuracy drops at high compression ratios.

Theoretical Foundations

Smoothness of importance: Using Weyl’s theorem, the authors prove that the importance of individual states changes smoothly during training.
Stability of rankings: Empirical evidence shows that state rankings remain stable after the early warm‑up, giving confidence that early‑pruned dimensions will not become critical later.

Practical Safety Net

If a compression step unexpectedly degrades performance, practitioners can revert to a previously saved checkpoint. This gives users fine‑grained control over the trade‑off between performance and resource savings, without relying on opaque energy thresholds.

Limitations

Model‑dependence: CompreSSM works best on architectures where internal state dimension strongly correlates with overall performance.
Task suitability: The method shines on multi‑input, multi‑output (MIMO) models, where the link between state size and expressivity is strongest.

For per‑channel, single‑input, single‑output architectures, the gains are more modest, since those models are less sensitive to state‑dimension changes in the first place. The theory applies most cleanly to linear time‑invariant systems, although the team has developed extensions for the increasingly popular input‑dependent, time‑varying architectures. Because the family of state‑space models extends to architectures like linear attention—a growing alternative to traditional transformers—the potential scope of application is broad.

Future Directions

Chahine and his collaborators see the work as a stepping stone. The team has already demonstrated an extension to linear time‑varying systems like Mamba, and future work includes pushing CompreSSM further into matrix‑valued dynamical systems used in linear attention mechanisms, bringing the technique closer to the transformer architectures that underpin most of today’s largest AI systems.

“This had to be the first step, because this is where the theory is neat and the approach can stay principled,” Chahine says. “It’s the stepping stone to then extend to other architectures that people are using in industry today.”

“The work of Chahine and his colleagues provides an intriguing, theoretically grounded perspective on compression for modern state‑space models (SSMs),” says Antonio Orvieto, ELLIS Institute Tübingen principal investigator and MPI for Intelligent Systems independent group leader, who wasn’t involved in the research. “The method provides evidence that the state dimension of these models can be effectively reduced during training and that a control‑theoretic perspective can successfully guide this procedure. The work opens new avenues for future research, and the proposed algorithm has the potential to become a standard approach when pre‑training large SSM‑based models.”

The work, which was accepted as a conference paper at the International Conference on Learning Representations 2026, will be presented later this month. It was supported, in part, by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the U.S. Office of Naval Research.