[Paper] Optimizing LLM Code Suggestions: Feedback-Driven Timing with Lightweight State Bounds
Source: arXiv - 2511.18842v1
Overview
Large Language Models (LLMs) have become the backbone of modern code‑completion tools, but developers often get suggestions at the wrong moment—either interrupting their flow or never being shown at all. This paper introduces a feedback‑driven timing system that learns when to surface LLM‑generated snippets, cutting wasted inference calls and boosting acceptance rates in real development environments.
Key Contributions
- Adaptive timing algorithm that adjusts suggestion latency based on a developer’s recent acceptance/rejection behavior.
- Lightweight state estimator: a binary predictor of the developer’s cognitive “readiness” that bounds the delay range without heavy instrumentation.
- Logistic transformation of acceptance rates to smoothly map recent feedback into a delay decision.
- Field study with professional developers over two months, demonstrating a 3‑fold increase in suggestion acceptance and a 75 % reduction in unnecessary inference calls.
- Open‑source reference implementation (lightweight enough to embed in existing IDE plugins).
Methodology
- Feedback Loop – Each time a suggestion is shown, the system records whether the developer accepts it, rejects it after reading, or never looks at it (blind rejection).
- State Bounding – A simple binary classifier (trained on lightweight signals such as cursor activity, typing speed, and recent IDE events) predicts whether the developer is “cognitively ready” for a suggestion. This prediction caps the maximum delay.
- Logistic Delay Mapping – Recent acceptance rates are fed through a logistic function, producing a smooth probability curve that translates into a concrete delay (e.g., 200 ms – 2 s). Higher acceptance pushes the delay lower, making suggestions appear sooner.
- Deployment – The algorithm runs inside the IDE plugin, adjusting the delay before each inference request to the LLM. No heavyweight profiling or server‑side changes are required.
- Evaluation – Over two months, the authors compared three configurations: (a) no delay, (b) static delays (fixed 500 ms), and (c) the adaptive timing system. Metrics included suggestion acceptance, blind rejection rate, and total inference calls.
Results & Findings
| Configuration | Acceptance Rate | Blind Rejection | Inference Calls Saved |
|---|---|---|---|
| No delay | 4.9 % | 8.3 % | – |
| Static delay | 15.4 % | 2.1 % | ~45 % |
| Adaptive timing (this work) | 18.6 % | 0.36 % | ≈75 % |
- Higher acceptance: Adaptive timing nudged suggestions to moments when developers were more receptive, lifting acceptance from under 5 % to nearly 19 %.
- Drastic drop in blind rejections: By postponing suggestions until the developer was likely to notice them, the system cut unread dismissals by more than 20×.
- Cost efficiency: Because each LLM inference can be expensive (especially for large models), the 75 % reduction in unnecessary calls translates directly into lower cloud spend and faster IDE responsiveness.
Practical Implications
- IDE Plugin Developers – The adaptive timing logic can be dropped into existing autocomplete extensions with minimal overhead, improving user experience without redesigning the underlying LLM.
- Enterprise Tooling – Companies that host internal LLM services can reap immediate cost savings by avoiding needless inference calls, especially in large teams where autocomplete traffic is high.
- Developer Productivity – Fewer interruptions and more relevant suggestions mean smoother coding sessions, potentially reducing context‑switching fatigue.
- LLM Service Providers – Offering “smart timing” as a service‑side feature could become a differentiator, allowing providers to charge per‑call models while delivering higher ROI to customers.
Limitations & Future Work
- Binary cognitive state – The current predictor only distinguishes “ready” vs. “not ready,” which may oversimplify nuanced developer states.
- Signal set – The model relies on generic IDE events; richer telemetry (e.g., eye‑tracking, voice commands) could improve accuracy but raises privacy concerns.
- Generalizability – The field study focused on a specific set of professional developers and languages; broader evaluations across diverse tech stacks are needed.
- Model‑agnosticity – While the approach is lightweight, integrating it with extremely low‑latency on‑device LLMs may require further tuning of the delay bounds.
Future research could explore multi‑class cognitive modeling, incorporate user‑customizable timing preferences, and evaluate long‑term effects on code quality and developer satisfaction.
Authors
- Mohammad Nour Al Awad
- Sergey Ivanov
- Olga Tikhonova
Paper Information
- arXiv ID: 2511.18842v1
- Categories: cs.SE, cs.AI, cs.HC
- Published: November 24, 2025
- PDF: Download PDF