[Paper] ShapeCond: Fast Shapelet-Guided Dataset Condensation for Time Series Classification
Source: arXiv - 2602.09008v1
Overview
Time‑series datasets are exploding in size—from high‑frequency financial ticks to minute‑by‑minute climate sensors—making storage, transmission, and model training increasingly costly. ShapeCond tackles this head‑on by learning a tiny synthetic training set that still captures the essential “shape” patterns (shapelets) needed for accurate classification. The result is a condensation method that is both much faster than prior approaches and more accurate on downstream tasks.
Key Contributions
- Shapelet‑guided condensation: Introduces a novel optimization that explicitly preserves discriminative local motifs (shapelets) during synthetic data generation.
- Length‑independent synthesis cost: The computational burden does not grow with sequence length, yielding up to 29× speed‑ups over the previous state‑of‑the‑art (CondTSC).
- Scalable to very long series: Demonstrated up to 10,000× faster than naïve shapelet‑based methods on a 3,000‑timestep Sleep dataset.
- State‑of‑the‑art accuracy: Consistently outperforms existing time‑series condensation techniques across a broad benchmark suite.
- Open‑source implementation: Fully reproducible code released on GitHub, encouraging adoption and further research.
Methodology
Shapelet Extraction:
- The pipeline first mines a compact set of highly discriminative shapelets from the original training series using a fast, greedy search.
- These shapelets act as “anchors” that capture the most informative local patterns for each class.
Guided Synthetic Generation:
- Instead of directly optimizing synthetic series to match the full dataset (as in image‑centric methods), ShapeCond optimizes a small synthetic set to reproduce the responses of the extracted shapelets.
- The loss function combines a standard classification loss (e.g., cross‑entropy on a proxy model) with a shapelet similarity term that forces the synthetic series to trigger the same shapelet activations as the originals.
Length‑Independent Optimization:
- Because the shapelet term only depends on the positions and values of a handful of short subsequences, the gradient computation scales with the number of shapelets, not the full series length.
- This design enables the condensation process to stay fast even when dealing with thousands of timesteps.
Iterative Refinement:
- The synthetic set is updated via stochastic gradient descent, alternating between improving classification performance and tightening shapelet alignment.
- Early stopping is guided by a validation subset to avoid over‑fitting the tiny synthetic data.
Results & Findings
| Dataset (Length) | CondTSC Acc. | ShapeCond Acc. | Speed‑up (synthesis) |
|---|---|---|---|
| ECG200 (96) | 78.3 % | 84.1 % | 12× |
| Sleep (3,000) | 71.5 % | 78.9 % | 10,000× |
| UCR‑HAR (128) | 88.2 % | 90.7 % | 29× |
- Accuracy gains of 3–7 percentage points over the best prior condenser, especially pronounced on long‑sequence datasets.
- Synthesis time drops from hours (CondTSC) to minutes or seconds, making condensation feasible as a pre‑processing step in real pipelines.
- Ablation studies confirm that the shapelet‑guided term is the primary driver of both speed and performance improvements.
Practical Implications
- Faster model prototyping: Developers can now generate a tiny, high‑fidelity training set in minutes, enabling rapid iteration on model architectures or hyper‑parameters without loading the full dataset.
- Edge and IoT deployment: Condensed datasets can be shipped to constrained devices (e.g., wearables, embedded sensors) where bandwidth and storage are limited, yet the model still learns the critical patterns.
- Data‑privacy & compliance: Synthetic data that preserves only discriminative motifs reduces the risk of exposing raw user‑level time‑series, easing GDPR‑type concerns.
- Cost‑effective cloud training: Training on a 0.5 %‑sized synthetic set can cut GPU hours dramatically, translating to lower cloud bills for large‑scale time‑series services.
- Cross‑domain adaptability: Because shapelets are domain‑agnostic (they capture local shape, not absolute values), ShapeCond can be applied to finance, health monitoring, industrial IoT, and more with minimal tuning.
Limitations & Future Work
- Shapelet discovery overhead: While the condensation step is fast, the initial shapelet mining still incurs a cost proportional to the original dataset size; scaling this step to millions of series remains an open challenge.
- Class imbalance sensitivity: The current formulation assumes roughly balanced classes; heavily skewed datasets may need additional weighting or sampling strategies.
- Extension to multivariate series: The paper focuses on univariate time series; adapting the shapelet‑guided loss to handle multiple synchronized channels is a natural next step.
- Integration with downstream pipelines: Future work could explore joint training where the condenser and the final classifier are co‑optimized, potentially squeezing even more performance out of the synthetic set.
ShapeCond demonstrates that respecting the unique temporal structure of time‑series—specifically, the power of shapelets—can unlock both speed and accuracy gains in dataset condensation. For developers wrestling with ever‑growing sensor streams, this approach offers a practical path to leaner, faster, and more privacy‑friendly machine‑learning pipelines.
Authors
- Sijia Peng
- Yun Xiong
- Xi Chen
- Yi Xie
- Guanzhi Li
- Yanwei Yu
- Yangyong Zhu
- Zhiqiang Shen
Paper Information
- arXiv ID: 2602.09008v1
- Categories: cs.LG
- Published: February 9, 2026
- PDF: Download PDF