[Paper] Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning

Published: (December 18, 2025 at 12:54 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16813v1

Overview

This paper tackles a pressing problem for autonomous robot swarms: reactive jammers that sense the network’s activity and selectively jam communications, breaking formation coordination and mission goals. By framing the anti‑jamming problem as a multi‑agent reinforcement learning (MARL) task, the authors show how a swarm can learn to pick frequencies and transmit powers in a coordinated way that stays one step ahead of an adaptive jammer.

Key Contributions

  • MARL‑based anti‑jamming framework: Introduces a decentralized yet coordinated learning solution using the QMIX algorithm, which learns a joint action‑value function that can be factorized for individual agents.
  • Realistic jammer model: Models a reactive jammer with Markovian threshold dynamics that senses aggregate power and decides when/where to jam, reflecting practical adversarial behavior.
  • Comprehensive benchmarking: Evaluates QMIX against a genie‑aided optimal policy, a local Upper Confidence Bound (UCB) bandit approach, and a stateless reactive policy, covering both no‑reuse and channel‑reuse fading scenarios.
  • Performance close to optimal: Demonstrates that QMIX converges quickly to policies that achieve throughput within a few percent of the genie‑aided bound while drastically reducing successful jamming events.
  • Scalable to larger swarms: Shows that the factorized value function enables decentralized execution, making the approach viable for swarms with many agents and limited on‑board compute.

Methodology

  1. System model

    • A swarm consists of multiple transmitter‑receiver pairs sharing a set of frequency channels.
    • Each agent decides (channel, power) jointly at every time step.
    • The reactive jammer monitors the total received power; if it exceeds a hidden threshold, it jams the most interfered channel for the next slot (Markovian dynamics).
  2. Learning formulation

    • The problem is cast as a cooperative Dec‑POMDP: agents share a common reward (e.g., successful packet delivery, low interference).
    • QMIX learns a centralized action‑value function Q_tot that is monotonic in each agent’s local Q‑value, allowing the global optimum to be recovered by each agent acting greedily on its own Q‑function.
  3. Training pipeline

    • Simulated episodes generate state‑action‑reward tuples.
    • Experience replay buffers store transitions for off‑policy updates.
    • The network architecture uses a recurrent encoder for each agent (to handle partial observability) and a mixing network that enforces the monotonicity constraint.
  4. Baselines

    • Genie‑aided optimal: exhaustive search over all joint actions (only feasible for small networks).
    • Local UCB: each agent treats each (channel, power) pair as a bandit arm and selects via Upper Confidence Bound.
    • Stateless reactive: a heuristic that switches channels when jamming is detected, without learning.

Results & Findings

MetricQMIXGenie‑aided optimalLocal UCBStateless reactive
Throughput (packets/slot)0.92 × optimal1.000.68 × optimal0.55 × optimal
Jamming success rate8 %0 %31 %44 %
Convergence time≈ 2 k episodesN/A (offline)> 10 k episodesN/A (rule‑based)
  • Rapid convergence: QMIX reaches > 90 % of optimal throughput within a few thousand training episodes, far faster than the UCB baseline.
  • Robustness to fading & channel reuse: Even when multiple agents share the same channel under realistic fading, QMIX maintains a clear advantage, adapting power levels to mitigate interference.
  • Scalability: Experiments with up to 12 agents show only modest degradation, confirming the factorized value function’s ability to handle larger swarms without exponential blow‑up.

Practical Implications

  • Secure swarm deployments: Developers building UAV, ground‑robot, or IoT swarms can embed a lightweight QMIX‑derived policy to autonomously avoid jamming without needing a central controller.
  • Dynamic spectrum access: The joint channel‑power selection can be repurposed for civilian spectrum‑sharing scenarios (e.g., industrial IoT in congested ISM bands) where interference is unpredictable.
  • Edge‑friendly inference: Once trained, each agent only runs a small feed‑forward network to evaluate its local Q‑values, fitting within typical embedded compute budgets (e.g., ARM Cortex‑M or low‑power GPUs).
  • Rapid adaptation: Because the policy is learned offline but executed online, swarms can be pre‑trained against a family of jammer behaviors and then fine‑tuned on‑site with minimal data, enabling continuous resilience.

Limitations & Future Work

  • Training overhead: The current approach relies on extensive simulated episodes; transferring to real hardware may require domain‑randomization or sim‑to‑real techniques.
  • Assumed shared reward: The cooperative reward structure presumes all agents have aligned objectives; future work could explore mixed‑cooperation/competition settings (e.g., heterogeneous missions).
  • Static jammer model: The jammer follows a Markovian threshold rule; more sophisticated adversaries (e.g., learning jammers) remain an open challenge.
  • Scalability beyond dozens of agents: While factorization helps, extremely large swarms may need hierarchical MARL or communication‑efficient approximations.

Overall, the paper demonstrates that modern MARL—specifically QMIX—can give autonomous swarms a practical, data‑driven shield against adaptive jamming, opening the door to more robust field deployments.

Authors

  • Bahman Abolhassani
  • Tugba Erpek
  • Kemal Davaslioglu
  • Yalin E. Sagduyu
  • Sastry Kompella

Paper Information

  • arXiv ID: 2512.16813v1
  • Categories: cs.NI, cs.AI, cs.DC, cs.LG, eess.SP
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...