[Paper] Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning
Source: arXiv - 2512.16813v1
Overview
This paper tackles a pressing problem for autonomous robot swarms: reactive jammers that sense the network’s activity and selectively jam communications, breaking formation coordination and mission goals. By framing the anti‑jamming problem as a multi‑agent reinforcement learning (MARL) task, the authors show how a swarm can learn to pick frequencies and transmit powers in a coordinated way that stays one step ahead of an adaptive jammer.
Key Contributions
- MARL‑based anti‑jamming framework: Introduces a decentralized yet coordinated learning solution using the QMIX algorithm, which learns a joint action‑value function that can be factorized for individual agents.
- Realistic jammer model: Models a reactive jammer with Markovian threshold dynamics that senses aggregate power and decides when/where to jam, reflecting practical adversarial behavior.
- Comprehensive benchmarking: Evaluates QMIX against a genie‑aided optimal policy, a local Upper Confidence Bound (UCB) bandit approach, and a stateless reactive policy, covering both no‑reuse and channel‑reuse fading scenarios.
- Performance close to optimal: Demonstrates that QMIX converges quickly to policies that achieve throughput within a few percent of the genie‑aided bound while drastically reducing successful jamming events.
- Scalable to larger swarms: Shows that the factorized value function enables decentralized execution, making the approach viable for swarms with many agents and limited on‑board compute.
Methodology
-
System model
- A swarm consists of multiple transmitter‑receiver pairs sharing a set of frequency channels.
- Each agent decides (channel, power) jointly at every time step.
- The reactive jammer monitors the total received power; if it exceeds a hidden threshold, it jams the most interfered channel for the next slot (Markovian dynamics).
-
Learning formulation
- The problem is cast as a cooperative Dec‑POMDP: agents share a common reward (e.g., successful packet delivery, low interference).
- QMIX learns a centralized action‑value function Q_tot that is monotonic in each agent’s local Q‑value, allowing the global optimum to be recovered by each agent acting greedily on its own Q‑function.
-
Training pipeline
- Simulated episodes generate state‑action‑reward tuples.
- Experience replay buffers store transitions for off‑policy updates.
- The network architecture uses a recurrent encoder for each agent (to handle partial observability) and a mixing network that enforces the monotonicity constraint.
-
Baselines
- Genie‑aided optimal: exhaustive search over all joint actions (only feasible for small networks).
- Local UCB: each agent treats each (channel, power) pair as a bandit arm and selects via Upper Confidence Bound.
- Stateless reactive: a heuristic that switches channels when jamming is detected, without learning.
Results & Findings
| Metric | QMIX | Genie‑aided optimal | Local UCB | Stateless reactive |
|---|---|---|---|---|
| Throughput (packets/slot) | 0.92 × optimal | 1.00 | 0.68 × optimal | 0.55 × optimal |
| Jamming success rate | 8 % | 0 % | 31 % | 44 % |
| Convergence time | ≈ 2 k episodes | N/A (offline) | > 10 k episodes | N/A (rule‑based) |
- Rapid convergence: QMIX reaches > 90 % of optimal throughput within a few thousand training episodes, far faster than the UCB baseline.
- Robustness to fading & channel reuse: Even when multiple agents share the same channel under realistic fading, QMIX maintains a clear advantage, adapting power levels to mitigate interference.
- Scalability: Experiments with up to 12 agents show only modest degradation, confirming the factorized value function’s ability to handle larger swarms without exponential blow‑up.
Practical Implications
- Secure swarm deployments: Developers building UAV, ground‑robot, or IoT swarms can embed a lightweight QMIX‑derived policy to autonomously avoid jamming without needing a central controller.
- Dynamic spectrum access: The joint channel‑power selection can be repurposed for civilian spectrum‑sharing scenarios (e.g., industrial IoT in congested ISM bands) where interference is unpredictable.
- Edge‑friendly inference: Once trained, each agent only runs a small feed‑forward network to evaluate its local Q‑values, fitting within typical embedded compute budgets (e.g., ARM Cortex‑M or low‑power GPUs).
- Rapid adaptation: Because the policy is learned offline but executed online, swarms can be pre‑trained against a family of jammer behaviors and then fine‑tuned on‑site with minimal data, enabling continuous resilience.
Limitations & Future Work
- Training overhead: The current approach relies on extensive simulated episodes; transferring to real hardware may require domain‑randomization or sim‑to‑real techniques.
- Assumed shared reward: The cooperative reward structure presumes all agents have aligned objectives; future work could explore mixed‑cooperation/competition settings (e.g., heterogeneous missions).
- Static jammer model: The jammer follows a Markovian threshold rule; more sophisticated adversaries (e.g., learning jammers) remain an open challenge.
- Scalability beyond dozens of agents: While factorization helps, extremely large swarms may need hierarchical MARL or communication‑efficient approximations.
Overall, the paper demonstrates that modern MARL—specifically QMIX—can give autonomous swarms a practical, data‑driven shield against adaptive jamming, opening the door to more robust field deployments.
Authors
- Bahman Abolhassani
- Tugba Erpek
- Kemal Davaslioglu
- Yalin E. Sagduyu
- Sastry Kompella
Paper Information
- arXiv ID: 2512.16813v1
- Categories: cs.NI, cs.AI, cs.DC, cs.LG, eess.SP
- Published: December 18, 2025
- PDF: Download PDF