[Paper] Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Published: 3 weeks ago (April 16, 2026 at 12:23 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.15190v1

Overview

The paper introduces Policy‑Guided Hybrid Simulation (PGHS), a new way to model how groups of users will react to merchant‑level policy changes on Meituan’s platform. By combining large‑language‑model (LLM) reasoning with traditional machine‑learning fitting, the authors achieve a much more reliable “what‑if” simulator that can replace costly online A/B tests for dozens of merchants.

Key Contributions

Dual‑process simulation framework that merges a reasoning‑oriented LLM branch with a data‑driven ML branch, each handling different aspects of user behavior.
Policy‑guided alignment layer that extracts reusable decision policies from historical trajectories and uses them to synchronize the two branches, preventing the LLM from over‑rationalizing missing context.
Fusion mechanism that blends predictions from both branches, delivering complementary corrections and higher overall fidelity.
Large‑scale deployment on Meituan’s live system covering 101 merchants and more than 26 k user‑merchant interaction trajectories.
Empirical gains: overall group‑level simulation error drops to 8.80 %, a 45.8 % improvement over the best reasoning‑only baseline and a 40.9 % boost over the best fitting‑only baseline.

Methodology

Data collection – The authors gather sequential interaction logs (e.g., search → click → purchase) for each merchant, forming “trajectories” that capture how users behaved under existing policies.
Policy extraction – From these trajectories they learn decision policies (e.g., “if discount > 10 % and rating > 4.5, probability of purchase ≈ 0.7”). These policies are lightweight, interpretable rules that can be shared across models.
Dual‑process architecture
- Reasoning branch (LLM) – A large language model is prompted with the extracted policies and the current context (merchant attributes, time of day, etc.). It generates a rational prediction of user actions, filling in gaps where the data are sparse.
- Fitting branch (ML) – A conventional supervised model (e.g., gradient‑boosted trees) is trained directly on the raw trajectories, capturing statistical regularities and implicit habits that the LLM may miss.
Alignment via the policy layer – Both branches receive the same policy cues, ensuring they stay grounded in observed decision patterns and reducing the LLM’s tendency to hallucinate.
Fusion – The two predictions are combined (weighted averaging with learned confidence scores) to produce a final group‑level estimate of user behavior under a hypothetical merchant policy.

The whole pipeline runs offline, enabling rapid counterfactual analysis without exposing real users to experimental changes.

Results & Findings

Metric	PGHS	Best Reasoning‑Only	Best Fitting‑Only
Group simulation error (↓)	8.80 %	16.30 %	14.85 %
Relative improvement	—	45.8 % reduction	40.9 % reduction

Error reduction is consistent across merchants of varying sizes and across different policy levers (discount rates, recommendation slots, etc.).
Ablation studies show that removing the policy‑guided alignment inflates LLM error by ~12 %, confirming its stabilizing role.
Fusion benefits: using only one branch yields errors >12 %; the combined output consistently outperforms either component alone.

Practical Implications

Cost‑effective experimentation – Companies can evaluate dozens of merchant‑level tweaks in a sandbox environment, cutting down on expensive, time‑consuming A/B tests.
Faster product cycles – Product managers get near‑real‑time feedback on policy proposals, enabling rapid iteration on pricing, promotion, or UI changes.
Risk mitigation – Simulating worst‑case scenarios before rollout helps avoid revenue drops or user churn caused by poorly calibrated incentives.
Transferability – The policy‑guided dual‑process design is platform‑agnostic; it can be adapted to other marketplaces (e‑commerce, ride‑hailing, streaming) where group‑level user simulation is valuable.
Developer‑friendly tooling – The authors expose the policy extraction and fusion logic as modular components, making integration into existing data pipelines straightforward.

Limitations & Future Work

Contextual blind spots – While the policy layer curbs over‑rationalization, the LLM still depends on the quality of prompts; rare or novel contexts may be mis‑predicted.
Scalability of policy mining – Extracting interpretable policies from very large or highly heterogeneous datasets can become computationally heavy; the paper suggests approximate rule mining as a possible remedy.
Evaluation scope – Experiments focus on group‑level metrics; individual‑user personalization effects remain unexplored.
Future directions – Extending PGHS to incorporate reinforcement‑learning‑based policy updates, testing on cross‑domain datasets, and automating the confidence‑weight learning for fusion are highlighted as next steps.

Authors

Ziyang Chen
Renbing Chen
Daowei Li
Jinzhi Liao
Jiashen Sun
Ke Zeng
Xiang Zhao

Paper Information

arXiv ID: 2604.15190v1
Categories: cs.AI, cs.CL
Published: April 16, 2026
PDF: Download PDF

[Paper] Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints