[Paper] Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation
Source: arXiv - 2604.15190v1
Overview
The paper introduces Policy‑Guided Hybrid Simulation (PGHS), a new way to model how groups of users will react to merchant‑level policy changes on Meituan’s platform. By combining large‑language‑model (LLM) reasoning with traditional machine‑learning fitting, the authors achieve a much more reliable “what‑if” simulator that can replace costly online A/B tests for dozens of merchants.
Key Contributions
- Dual‑process simulation framework that merges a reasoning‑oriented LLM branch with a data‑driven ML branch, each handling different aspects of user behavior.
- Policy‑guided alignment layer that extracts reusable decision policies from historical trajectories and uses them to synchronize the two branches, preventing the LLM from over‑rationalizing missing context.
- Fusion mechanism that blends predictions from both branches, delivering complementary corrections and higher overall fidelity.
- Large‑scale deployment on Meituan’s live system covering 101 merchants and more than 26 k user‑merchant interaction trajectories.
- Empirical gains: overall group‑level simulation error drops to 8.80 %, a 45.8 % improvement over the best reasoning‑only baseline and a 40.9 % boost over the best fitting‑only baseline.
Methodology
- Data collection – The authors gather sequential interaction logs (e.g., search → click → purchase) for each merchant, forming “trajectories” that capture how users behaved under existing policies.
- Policy extraction – From these trajectories they learn decision policies (e.g., “if discount > 10 % and rating > 4.5, probability of purchase ≈ 0.7”). These policies are lightweight, interpretable rules that can be shared across models.
- Dual‑process architecture
- Reasoning branch (LLM) – A large language model is prompted with the extracted policies and the current context (merchant attributes, time of day, etc.). It generates a rational prediction of user actions, filling in gaps where the data are sparse.
- Fitting branch (ML) – A conventional supervised model (e.g., gradient‑boosted trees) is trained directly on the raw trajectories, capturing statistical regularities and implicit habits that the LLM may miss.
- Alignment via the policy layer – Both branches receive the same policy cues, ensuring they stay grounded in observed decision patterns and reducing the LLM’s tendency to hallucinate.
- Fusion – The two predictions are combined (weighted averaging with learned confidence scores) to produce a final group‑level estimate of user behavior under a hypothetical merchant policy.
The whole pipeline runs offline, enabling rapid counterfactual analysis without exposing real users to experimental changes.
Results & Findings
| Metric | PGHS | Best Reasoning‑Only | Best Fitting‑Only |
|---|---|---|---|
| Group simulation error (↓) | 8.80 % | 16.30 % | 14.85 % |
| Relative improvement | — | 45.8 % reduction | 40.9 % reduction |
- Error reduction is consistent across merchants of varying sizes and across different policy levers (discount rates, recommendation slots, etc.).
- Ablation studies show that removing the policy‑guided alignment inflates LLM error by ~12 %, confirming its stabilizing role.
- Fusion benefits: using only one branch yields errors >12 %; the combined output consistently outperforms either component alone.
Practical Implications
- Cost‑effective experimentation – Companies can evaluate dozens of merchant‑level tweaks in a sandbox environment, cutting down on expensive, time‑consuming A/B tests.
- Faster product cycles – Product managers get near‑real‑time feedback on policy proposals, enabling rapid iteration on pricing, promotion, or UI changes.
- Risk mitigation – Simulating worst‑case scenarios before rollout helps avoid revenue drops or user churn caused by poorly calibrated incentives.
- Transferability – The policy‑guided dual‑process design is platform‑agnostic; it can be adapted to other marketplaces (e‑commerce, ride‑hailing, streaming) where group‑level user simulation is valuable.
- Developer‑friendly tooling – The authors expose the policy extraction and fusion logic as modular components, making integration into existing data pipelines straightforward.
Limitations & Future Work
- Contextual blind spots – While the policy layer curbs over‑rationalization, the LLM still depends on the quality of prompts; rare or novel contexts may be mis‑predicted.
- Scalability of policy mining – Extracting interpretable policies from very large or highly heterogeneous datasets can become computationally heavy; the paper suggests approximate rule mining as a possible remedy.
- Evaluation scope – Experiments focus on group‑level metrics; individual‑user personalization effects remain unexplored.
- Future directions – Extending PGHS to incorporate reinforcement‑learning‑based policy updates, testing on cross‑domain datasets, and automating the confidence‑weight learning for fusion are highlighted as next steps.
Authors
- Ziyang Chen
- Renbing Chen
- Daowei Li
- Jinzhi Liao
- Jiashen Sun
- Ke Zeng
- Xiang Zhao
Paper Information
- arXiv ID: 2604.15190v1
- Categories: cs.AI, cs.CL
- Published: April 16, 2026
- PDF: Download PDF