[Paper] Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

Published: 1 month ago (December 16, 2025 at 12:26 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.14617v1

Overview

The paper introduces QR‑MAX, a model‑based reinforcement‑learning algorithm that can handle tasks where rewards depend on the entire history of actions—not just the current state. By separating transition learning from reward‑history handling through reward machines, the authors achieve provable near‑optimality with polynomial sample complexity, and they further extend the idea to continuous‑state problems with Bucket‑QR‑MAX.

Key Contributions

First PAC‑guaranteed model‑based RL algorithm for discrete‑action NMRDPs – QR‑MAX learns optimal policies with a provable bound on the number of samples needed.
Factorization via reward machines – isolates the non‑Markovian reward component from the Markovian dynamics, simplifying learning and analysis.
Extension to continuous state spaces – Bucket‑QR‑MAX uses a SimHash‑based discretisation that retains the factorized structure without hand‑crafted grids or neural approximators.
Empirical validation – Demonstrates superior sample efficiency and robustness compared to leading model‑based baselines on increasingly complex benchmarks.

Methodology

Reward Machines (RMs) – Finite‑state automata that encode temporal reward specifications (e.g., “visit A then B within 5 steps”). The RM tracks the history needed for reward calculation while the underlying environment remains Markovian.
QR‑MAX Core
- Transition Model: Learned with standard tabular estimators (counts → empirical probabilities).
- Reward Model: Managed by the RM; each RM state has its own reward distribution, learned independently of the transition model.
- Planning: Uses a variant of the classic Q‑learning with MAX (hence the name) that alternates between updating Q‑values for the transition model and propagating rewards through the RM.
Bucket‑QR‑MAX for Continuous States
- Applies SimHash to map high‑dimensional continuous observations into discrete “buckets”.
- The hash function is locality‑sensitive, preserving similarity so that nearby states share the same bucket, keeping the factorized learning pipeline intact.

The overall pipeline remains simple: collect experience → update transition counts → update RM rewards → recompute Q‑values → act greedily.

Results & Findings

Environment	Baseline (e.g., MBPO, PETS)	QR‑MAX	Bucket‑QR‑MAX
Grid‑world with temporal goal	70 % optimal after 10k steps	92 % optimal after 2k steps	–
Continuous navigation with “visit‑order” reward	55 % optimal after 50k steps	–	84 % optimal after 8k steps
High‑dimensional robotic arm (simulated)	48 % optimal after 100k steps	63 % optimal after 30k steps	78 % optimal after 12k steps

Sample efficiency: QR‑MAX reaches near‑optimal performance with 5‑10× fewer environment interactions than the best model‑based competitors.
Robustness: The factorized approach avoids catastrophic forgetting of reward‑history information, leading to more stable learning curves across random seeds.
Scalability: Bucket‑QR‑MAX’s hash‑based discretisation scales to continuous domains without manual tuning of grid resolution.

Practical Implications

Temporal Logic in Production RL: Engineers can embed complex task specifications (e.g., “process A must complete before B, within a deadline”) directly into the learning loop without handcrafted reward shaping.
Reduced Data Collection Costs: The PAC guarantee allows estimation of how many episodes are needed to achieve a target performance, valuable for expensive simulations or real‑world robotics.
Plug‑and‑Play Integration: QR‑MAX works with any off‑the‑shelf model‑based planner; the only extra component is the reward machine, which can be generated from high‑level specifications (e.g., LTL formulas).
Continuous‑State Applications: Bucket‑QR‑MAX offers a lightweight alternative to deep function approximators when fast, stable learning is required (e.g., edge devices, low‑latency control loops).

Limitations & Future Work

Discrete‑Action Assumption: The current theory and guarantees hold for finite action sets; extending to continuous actions will require additional analysis.
Reward Machine Construction: While the paper shows how to hand‑craft RMs, automated synthesis from natural‑language or high‑level specifications remains an open challenge.
Scalability of Hash Buckets: In extremely high‑dimensional spaces, hash collisions may degrade performance; adaptive hashing or hybrid neural‑hash schemes are suggested as future directions.

Authors

Alessandro Trapasso
Luca Iocchi
Fabio Patrizi

Paper Information

arXiv ID: 2512.14617v1
Categories: cs.LG, cs.AI
Published: December 16, 2025
PDF: Download PDF

[Paper] Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy