[Paper] Large Language Models for Limited Noisy Data: A Gravitational Wave Identification Study

Published: (December 3, 2025 at 01:13 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04031v1

Overview

The paper explores whether large language models (LLMs) can outperform conventional neural networks when the data are scarce, noisy, and non‑Gaussian—a common situation in astrophysics. Using only 90 real LIGO gravitational‑wave (GW) events, the authors fine‑tune LLMs and achieve 97.4 % accuracy in distinguishing true GW signals from noise, suggesting that LLMs can learn directly from limited observational data without massive simulated training sets.

Key Contributions

  • LLM‑centric pipeline for GW signal identification that works with a tiny, real‑world dataset (90 events).
  • Empirical demonstration that adding more simulated GW samples does not improve LLM performance, unlike traditional convolutional or recurrent networks.
  • Scaling analysis showing predictable accuracy gains as model size and genuine data volume increase.
  • Cross‑domain insight: the same approach could be transferred to other noisy astronomical domains (e.g., radio transients, pulsar timing).
  • Open‑source baseline (code & fine‑tuned checkpoints) released for reproducibility and rapid adoption.

Methodology

  1. Data preparation – The authors collect 90 publicly released LIGO events (both confirmed GW signals and noise triggers). Each event is represented as a time‑frequency spectrogram, which is then tokenized into a sequence of visual “patch” tokens compatible with transformer architectures.
  2. Model selection – Several pre‑trained LLMs (e.g., GPT‑Neo, LLaMA‑7B) are repurposed as multimodal encoders. The language‑model weights remain largely intact; only a lightweight classification head is added.
  3. Fine‑tuning – Using a standard cross‑entropy loss, the models are trained for a few epochs on the 90‑sample set, employing data‑augmentation (time‑shifts, slight frequency scaling) to mitigate over‑fitting.
  4. Baselines – Classical CNNs and RNNs are trained on the same 90 real events and on enlarged synthetic datasets (thousands of simulated waveforms) to provide a fair comparison.
  5. Scaling experiments – The authors systematically vary model size (from 1 B to 13 B parameters) and the number of real training samples (30, 60, 90) to chart performance trends.

Results & Findings

ApproachTraining dataAccuracyComment
Fine‑tuned LLM (13 B)90 real LIGO events97.4 %Highest score; stable across runs
Fine‑tuned LLM (7 B)90 real events95.8 %Slight drop, still superior
CNN (trained on 90 real)90 real events84.2 %Over‑fits quickly
CNN (trained on 5 k simulated)5 k simulated + 90 real88.5 %Gains from simulation, but still behind LLM
RNN (trained on 5 k simulated)5 k simulated + 90 real86.9 %Similar trend
  • No benefit from extra simulated data for LLMs: performance plateaus after the 90 real samples.
  • Predictable scaling: each doubling of model parameters yields ~1–2 % accuracy gain when data are limited.
  • Robustness to noise: LLMs maintain high precision even when injected non‑Gaussian, non‑stationary noise is added to test spectrograms.

Practical Implications

  • Rapid prototyping: Researchers can fine‑tune an off‑the‑shelf LLM on a handful of real observations and obtain a production‑ready classifier, cutting down the need for costly simulation pipelines.
  • Resource efficiency: Since LLMs don’t require massive synthetic datasets, storage and compute budgets are lowered—especially valuable for smaller observatories or citizen‑science projects.
  • Cross‑modal extensions: The tokenization strategy works for any time‑frequency data (e.g., Fast Radio Bursts, pulsar timing arrays), opening a path to unified LLM‑based pipelines across multi‑messenger astronomy.
  • Real‑time alerts: A lightweight classification head on a pre‑trained LLM can be deployed at LIGO‑VIRGO data centers to flag candidate events within seconds, improving multi‑messenger follow‑up coordination.

Limitations & Future Work

  • Model size vs. latency: The best‑performing 13 B‑parameter model still incurs non‑trivial inference latency; pruning or distillation will be needed for real‑time pipelines.
  • Generalization to unseen sources: The study focuses on binary black‑hole mergers; performance on neutron‑star or exotic waveforms remains untested.
  • Interpretability: While attention maps hint at which spectrogram regions drive decisions, a systematic explainability analysis is still missing.
  • Broader validation: Future work should benchmark the approach on other observatories (e.g., KAGRA, LISA) and on truly heterogeneous datasets (radio, X‑ray).

Bottom line: This research shows that large language models, when fine‑tuned on a modest set of real gravitational‑wave observations, can outshine traditional neural networks even in the toughest noise environments—potentially reshaping how astronomers build data‑driven detectors across the spectrum.

Authors

  • Yixuan Li
  • Yuhao Lu
  • Yang Liu
  • Liang Li
  • R. Ruffini
  • Di Li
  • Rong-Gen Cai
  • Xiaoyan Zhu
  • Wenbin Lin
  • Yu Wang

Paper Information

  • arXiv ID: 2512.04031v1
  • Categories: astro-ph.IM, astro-ph.HE, cs.AI
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »