[Paper] VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection
Source: arXiv - 2602.16681v1
Overview
The paper introduces VETime, a novel framework that fuses raw time‑series data with visual representations to achieve zero‑shot anomaly detection. By aligning fine‑grained temporal cues with image‑based context, VETime bridges the long‑standing trade‑off between point‑level precision and global pattern awareness, delivering strong detection performance without any task‑specific training.
Key Contributions
- First multimodal TSAD architecture that jointly exploits 1‑D temporal signals and 2‑D visual patterns through a reversible image conversion pipeline.
- Patch‑Level Temporal Alignment (PTA) module that creates a shared visual‑temporal timeline, preserving per‑timestamp detail while enabling global context modeling.
- Anomaly Window Contrastive Learning (AWCL) that teaches the model to distinguish normal from anomalous windows without labeled anomalies.
- Task‑Adaptive Multi‑Modal Fusion (TAMF) that dynamically weights temporal vs. visual features based on the characteristics of each input segment.
- Zero‑shot capability: the system works out‑of‑the‑box on unseen datasets, outperforming state‑of‑the‑art baselines while using less compute than pure vision‑based methods.
Methodology
- Reversible Image Conversion – The raw series is reshaped into a 2‑D “image” (e.g., via Gramian Angular Field or Recurrence Plot) and can be converted back without loss, ensuring that visual processing never discards temporal fidelity.
- Patch‑Level Temporal Alignment – The image is split into patches; each patch is linked to its original timestamp via a lightweight alignment network, producing a joint embedding that respects both spatial and temporal ordering.
- Dual‑Branch Backbone –
- Temporal branch: a lightweight 1‑D transformer or CNN that excels at point‑wise anomaly scoring.
- Visual branch: a pre‑trained vision transformer (ViT) that captures long‑range patterns across the whole series.
- Anomaly Window Contrastive Learning – During pre‑training, randomly sampled windows are labeled as “normal” or “synthetic anomaly” (created by perturbations). The model learns to pull together embeddings of normal windows while pushing apart those containing anomalies.
- Task‑Adaptive Multi‑Modal Fusion – A gating mechanism evaluates the confidence of each branch for a given window and blends their anomaly scores, allowing the system to lean on the temporal branch for sharp spikes and on the visual branch for subtle drifts.
All components are trained once on generic time‑series corpora; inference on a new dataset requires zero‑shot deployment—no fine‑tuning needed.
Results & Findings
| Dataset (Zero‑Shot) | F1‑Score (Point) | F1‑Score (Window) | Avg. Inference Time (ms) |
|---|---|---|---|
| NAB (real‑world) | 0.84 | 0.78 | 12 |
| UCR Anomaly Suite | 0.81 | 0.74 | 15 |
| Yahoo S5 | 0.79 | 0.71 | 13 |
- VETime outperforms the best 1‑D baselines (e.g., LSTM‑AD, TCN) by 7–12 % in F1 while matching or beating vision‑only models (e.g., TimeGAN‑ViT) that require heavy fine‑tuning.
- The dynamic fusion reduces false positives on noisy point anomalies by ~30 % compared to using either branch alone.
- Computationally, VETime runs ~2× faster than pure vision‑based pipelines because the visual branch processes a compact image (often 64 × 64) and the temporal branch handles only short patches.
Practical Implications
- Plug‑and‑play anomaly monitoring: DevOps teams can drop VETime into existing telemetry pipelines (e.g., Prometheus, Grafana) and immediately start detecting both spikes and gradual drifts without labeling data.
- Edge deployment: The lightweight temporal branch and modest image size keep memory footprints low, making it feasible for IoT gateways or on‑device health monitoring.
- Cross‑domain reuse: Because the model is trained in a zero‑shot fashion, the same checkpoint can be applied to logs, sensor streams, financial tick data, or even user‑behavior metrics, saving the cost of domain‑specific model training.
- Improved alert precision: The fine‑grained alignment means alerts can be pinpointed to the exact timestamp, which is crucial for automated remediation scripts that need to know when an anomaly started.
Limitations & Future Work
- The reversible image conversion currently relies on fixed transforms (e.g., Gramian Angular Field); more expressive, learnable encodings could capture richer dynamics.
- While zero‑shot performance is strong, the authors note a modest drop on ultra‑high‑frequency data where the visual resolution becomes a bottleneck.
- Future research directions include extending the framework to multivariate series with heterogeneous sampling rates and exploring self‑supervised anomaly synthesis techniques to further reduce reliance on handcrafted perturbations.
Authors
- Yingyuan Yang
- Tian Lan
- Yifei Gao
- Yimeng Lu
- Wenjun He
- Meng Wang
- Chenghao Liu
- Chen Zhang
Paper Information
- arXiv ID: 2602.16681v1
- Categories: cs.CV
- Published: February 18, 2026
- PDF: Download PDF