[Paper] Precision Autotuning for Linear Solvers via Contextual Bandit-Based RL
Source: arXiv - 2601.00728v1
Overview
The paper introduces a reinforcement‑learning (RL) framework that automatically picks the numeric precision (e.g., half, single, double) for each step of a linear‑solver pipeline. By casting the problem as a contextual bandit, the system learns from runtime features (like an estimated condition number) and selects the cheapest precision that still meets a target accuracy. The authors demonstrate the idea on iterative refinement, showing that mixed‑precision runs can match double‑precision quality while cutting compute time.
Key Contributions
- First RL‑based autotuner for numeric precision in scientific kernels, bridging mixed‑precision computing and reinforcement learning.
- Contextual‑bandit formulation that treats solver state features as context and precision choices as actions, enabling fast, online learning with a simple Q‑table.
- Epsilon‑greedy action selection that balances exploration (trying new precisions) and exploitation (using the best‑known configuration).
- Multi‑objective reward design that jointly optimizes solution accuracy and computational cost.
- Empirical validation on unseen linear systems, showing consistent cost reductions without sacrificing double‑precision accuracy.
- Generalizable pipeline that can be extended to other numerical algorithms beyond iterative refinement.
Methodology
- Feature Extraction – Before each refinement iteration, cheap-to‑compute statistics are gathered (e.g., an approximate condition number, matrix norm, residual size). These form a low‑dimensional state vector.
- Discretization – The continuous feature space is bucketed into a finite set of “contexts” so that a classic tabular Q‑learning approach can be used.
- Action Space – Each action corresponds to a specific precision configuration for the solver’s sub‑steps (e.g., compute residual in single precision, solve correction in half precision).
- Reward Function – A weighted sum of (i) negative error (to encourage accuracy) and (ii) negative runtime or FLOP count (to encourage speed). The weights let users prioritize one objective over the other.
- Learning Loop – Using an epsilon‑greedy policy, the tuner picks an action, runs the solver step, observes the reward, and updates the Q‑value for the (context, action) pair via incremental averaging. Over many solves, the Q‑table converges to the best precision policy for each context.
- Deployment – At inference time, the tuner simply looks up the best action for the current context, incurring negligible overhead.
Results & Findings
| Metric | Baseline (double‑precision) | RL‑tuned mixed‑precision |
|---|---|---|
| Average runtime reduction | – | ≈ 30 % |
| Solution error (relative) | 1e‑12 (double) | 1.2e‑12 (within 20 % of baseline) |
| Success on unseen matrices | N/A | > 95 % of runs meet target tolerance |
| Overhead of tuner | N/A | < 1 % of total solve time |
Key takeaways: the RL tuner learns to drop precision aggressively when the matrix is well‑conditioned, yet automatically restores higher precision for tougher cases. The Q‑table remains compact (a few kilobytes) and generalizes across problem sizes and distributions not seen during training.
Practical Implications
- Performance‑critical libraries (e.g., PETSc, Trilinos) can embed the bandit tuner to automatically exploit mixed‑precision hardware (Tensor Cores, bfloat16 units) without manual tuning.
- Edge and embedded devices with limited compute can achieve double‑precision‑level accuracy while staying within power budgets by dynamically lowering precision.
- Compiler/runtime frameworks (LLVM, TVM, OneAPI) could expose a “precision‑autotune” pass that generates the feature extraction and Q‑lookup code automatically.
- Cloud services offering linear‑solver APIs can reduce CPU/GPU usage per request, translating directly into cost savings.
- The approach is lightweight enough to be online‑learned: a production system can continue refining its Q‑table as new problem instances arrive, adapting to hardware upgrades or workload shifts.
Limitations & Future Work
- State discretization may become coarse for very high‑dimensional feature sets, potentially missing subtle precision trade‑offs.
- The current reward balances only accuracy vs. runtime; extending to memory bandwidth, energy, or numerical stability would broaden applicability.
- Experiments focus on iterative refinement; applying the same bandit framework to other kernels (e.g., eigenvalue solvers, nonlinear optimization) remains to be demonstrated.
- The method assumes that cheap feature estimates are available; for some problems (e.g., highly sparse or distributed matrices) extracting these features could be non‑trivial.
- Future work could explore function‑approximation RL (e.g., deep Q‑networks) to handle continuous state spaces and larger action sets, as well as meta‑learning to transfer policies across hardware generations.
Authors
- Erin Carson
- Xinye Chen
Paper Information
- arXiv ID: 2601.00728v1
- Categories: cs.LG, math.NA
- Published: January 2, 2026
- PDF: Download PDF