[Paper] Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Published: (April 16, 2026 at 12:23 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.15190v1

Overview

The paper introduces Policy‑Guided Hybrid Simulation (PGHS), a new way to model how groups of users will react to merchant‑level policy changes on Meituan’s platform. By combining large‑language‑model (LLM) reasoning with traditional machine‑learning fitting, the authors achieve a much more reliable “what‑if” simulator that can replace costly online A/B tests for dozens of merchants.

Key Contributions

  • Dual‑process simulation framework that merges a reasoning‑oriented LLM branch with a data‑driven ML branch, each handling different aspects of user behavior.
  • Policy‑guided alignment layer that extracts reusable decision policies from historical trajectories and uses them to synchronize the two branches, preventing the LLM from over‑rationalizing missing context.
  • Fusion mechanism that blends predictions from both branches, delivering complementary corrections and higher overall fidelity.
  • Large‑scale deployment on Meituan’s live system covering 101 merchants and more than 26 k user‑merchant interaction trajectories.
  • Empirical gains: overall group‑level simulation error drops to 8.80 %, a 45.8 % improvement over the best reasoning‑only baseline and a 40.9 % boost over the best fitting‑only baseline.

Methodology

  1. Data collection – The authors gather sequential interaction logs (e.g., search → click → purchase) for each merchant, forming “trajectories” that capture how users behaved under existing policies.
  2. Policy extraction – From these trajectories they learn decision policies (e.g., “if discount > 10 % and rating > 4.5, probability of purchase ≈ 0.7”). These policies are lightweight, interpretable rules that can be shared across models.
  3. Dual‑process architecture
    • Reasoning branch (LLM) – A large language model is prompted with the extracted policies and the current context (merchant attributes, time of day, etc.). It generates a rational prediction of user actions, filling in gaps where the data are sparse.
    • Fitting branch (ML) – A conventional supervised model (e.g., gradient‑boosted trees) is trained directly on the raw trajectories, capturing statistical regularities and implicit habits that the LLM may miss.
  4. Alignment via the policy layer – Both branches receive the same policy cues, ensuring they stay grounded in observed decision patterns and reducing the LLM’s tendency to hallucinate.
  5. Fusion – The two predictions are combined (weighted averaging with learned confidence scores) to produce a final group‑level estimate of user behavior under a hypothetical merchant policy.

The whole pipeline runs offline, enabling rapid counterfactual analysis without exposing real users to experimental changes.

Results & Findings

MetricPGHSBest Reasoning‑OnlyBest Fitting‑Only
Group simulation error (↓)8.80 %16.30 %14.85 %
Relative improvement45.8 % reduction40.9 % reduction
  • Error reduction is consistent across merchants of varying sizes and across different policy levers (discount rates, recommendation slots, etc.).
  • Ablation studies show that removing the policy‑guided alignment inflates LLM error by ~12 %, confirming its stabilizing role.
  • Fusion benefits: using only one branch yields errors >12 %; the combined output consistently outperforms either component alone.

Practical Implications

  • Cost‑effective experimentation – Companies can evaluate dozens of merchant‑level tweaks in a sandbox environment, cutting down on expensive, time‑consuming A/B tests.
  • Faster product cycles – Product managers get near‑real‑time feedback on policy proposals, enabling rapid iteration on pricing, promotion, or UI changes.
  • Risk mitigation – Simulating worst‑case scenarios before rollout helps avoid revenue drops or user churn caused by poorly calibrated incentives.
  • Transferability – The policy‑guided dual‑process design is platform‑agnostic; it can be adapted to other marketplaces (e‑commerce, ride‑hailing, streaming) where group‑level user simulation is valuable.
  • Developer‑friendly tooling – The authors expose the policy extraction and fusion logic as modular components, making integration into existing data pipelines straightforward.

Limitations & Future Work

  • Contextual blind spots – While the policy layer curbs over‑rationalization, the LLM still depends on the quality of prompts; rare or novel contexts may be mis‑predicted.
  • Scalability of policy mining – Extracting interpretable policies from very large or highly heterogeneous datasets can become computationally heavy; the paper suggests approximate rule mining as a possible remedy.
  • Evaluation scope – Experiments focus on group‑level metrics; individual‑user personalization effects remain unexplored.
  • Future directions – Extending PGHS to incorporate reinforcement‑learning‑based policy updates, testing on cross‑domain datasets, and automating the confidence‑weight learning for fusion are highlighted as next steps.

Authors

  • Ziyang Chen
  • Renbing Chen
  • Daowei Li
  • Jinzhi Liao
  • Jiashen Sun
  • Ke Zeng
  • Xiang Zhao

Paper Information

  • arXiv ID: 2604.15190v1
  • Categories: cs.AI, cs.CL
  • Published: April 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »