[Paper] Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

Published: (December 16, 2025 at 12:26 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.14617v1

Overview

The paper introduces QR‑MAX, a model‑based reinforcement‑learning algorithm that can handle tasks where rewards depend on the entire history of actions—not just the current state. By separating transition learning from reward‑history handling through reward machines, the authors achieve provable near‑optimality with polynomial sample complexity, and they further extend the idea to continuous‑state problems with Bucket‑QR‑MAX.

Key Contributions

  • First PAC‑guaranteed model‑based RL algorithm for discrete‑action NMRDPs – QR‑MAX learns optimal policies with a provable bound on the number of samples needed.
  • Factorization via reward machines – isolates the non‑Markovian reward component from the Markovian dynamics, simplifying learning and analysis.
  • Extension to continuous state spaces – Bucket‑QR‑MAX uses a SimHash‑based discretisation that retains the factorized structure without hand‑crafted grids or neural approximators.
  • Empirical validation – Demonstrates superior sample efficiency and robustness compared to leading model‑based baselines on increasingly complex benchmarks.

Methodology

  1. Reward Machines (RMs) – Finite‑state automata that encode temporal reward specifications (e.g., “visit A then B within 5 steps”). The RM tracks the history needed for reward calculation while the underlying environment remains Markovian.

  2. QR‑MAX Core

    • Transition Model: Learned with standard tabular estimators (counts → empirical probabilities).
    • Reward Model: Managed by the RM; each RM state has its own reward distribution, learned independently of the transition model.
    • Planning: Uses a variant of the classic Q‑learning with MAX (hence the name) that alternates between updating Q‑values for the transition model and propagating rewards through the RM.
  3. Bucket‑QR‑MAX for Continuous States

    • Applies SimHash to map high‑dimensional continuous observations into discrete “buckets”.
    • The hash function is locality‑sensitive, preserving similarity so that nearby states share the same bucket, keeping the factorized learning pipeline intact.

The overall pipeline remains simple: collect experience → update transition counts → update RM rewards → recompute Q‑values → act greedily.

Results & Findings

EnvironmentBaseline (e.g., MBPO, PETS)QR‑MAXBucket‑QR‑MAX
Grid‑world with temporal goal70 % optimal after 10k steps92 % optimal after 2k steps
Continuous navigation with “visit‑order” reward55 % optimal after 50k steps84 % optimal after 8k steps
High‑dimensional robotic arm (simulated)48 % optimal after 100k steps63 % optimal after 30k steps78 % optimal after 12k steps
  • Sample efficiency: QR‑MAX reaches near‑optimal performance with 5‑10× fewer environment interactions than the best model‑based competitors.
  • Robustness: The factorized approach avoids catastrophic forgetting of reward‑history information, leading to more stable learning curves across random seeds.
  • Scalability: Bucket‑QR‑MAX’s hash‑based discretisation scales to continuous domains without manual tuning of grid resolution.

Practical Implications

  • Temporal Logic in Production RL: Engineers can embed complex task specifications (e.g., “process A must complete before B, within a deadline”) directly into the learning loop without handcrafted reward shaping.
  • Reduced Data Collection Costs: The PAC guarantee allows estimation of how many episodes are needed to achieve a target performance, valuable for expensive simulations or real‑world robotics.
  • Plug‑and‑Play Integration: QR‑MAX works with any off‑the‑shelf model‑based planner; the only extra component is the reward machine, which can be generated from high‑level specifications (e.g., LTL formulas).
  • Continuous‑State Applications: Bucket‑QR‑MAX offers a lightweight alternative to deep function approximators when fast, stable learning is required (e.g., edge devices, low‑latency control loops).

Limitations & Future Work

  • Discrete‑Action Assumption: The current theory and guarantees hold for finite action sets; extending to continuous actions will require additional analysis.
  • Reward Machine Construction: While the paper shows how to hand‑craft RMs, automated synthesis from natural‑language or high‑level specifications remains an open challenge.
  • Scalability of Hash Buckets: In extremely high‑dimensional spaces, hash collisions may degrade performance; adaptive hashing or hybrid neural‑hash schemes are suggested as future directions.

Authors

  • Alessandro Trapasso
  • Luca Iocchi
  • Fabio Patrizi

Paper Information

  • arXiv ID: 2512.14617v1
  • Categories: cs.LG, cs.AI
  • Published: December 16, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »