[Paper] PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

Published: 3 weeks ago (April 14, 2026 at 01:27 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.12995v1

Overview

The paper introduces PolicyBench, the first large‑scale benchmark that measures how well large language models (LLMs) understand and reason about public policy across the United States and China. Using this benchmark, the authors also present PolicyMoE, a Mixture‑of‑Experts (MoE) architecture that aligns specialist “experts” with different cognitive levels (memorization, understanding, application). The work shines a light on the gaps in current LLMs when they are asked to handle real‑world policy questions and proposes a concrete path toward more reliable, policy‑aware AI assistants.

Key Contributions

PolicyBench dataset: 21 K carefully curated policy cases spanning 10+ domains (health, finance, environment, etc.) and two geopolitical contexts (US & China).
Three‑tier evaluation based on Bloom’s taxonomy:
1. Memorization – factual recall of statutes, regulations, and key figures.
2. Understanding – conceptual reasoning and contextual interpretation.
3. Application – solving concrete policy‑driven scenarios (e.g., compliance checks, impact analysis).
PolicyMoE model: an MoE LLM where each expert is fine‑tuned on data specific to one Bloom level, enabling the system to route queries to the most appropriate specialist.
Comprehensive analysis of several state‑of‑the‑art LLMs (GPT‑4, Claude, LLaMA‑2, etc.) on PolicyBench, revealing systematic weaknesses in higher‑order reasoning.
Open‑source release of the benchmark, evaluation scripts, and the PolicyMoE checkpoints to foster community research.

Methodology

Data Collection & Curation
- Extracted policy documents, legislative texts, and regulatory guidelines from official US and Chinese government portals.
- Engaged policy analysts to annotate each case with a Bloom‑level label and to write multiple-choice and open‑ended questions.
Benchmark Construction
- Split the 21 K cases into train/validation/test sets while preserving domain and jurisdiction balance.
- Designed three task formats: factual recall (multiple‑choice), conceptual explanation (short answer), and scenario‑based problem solving (structured reasoning).
PolicyMoE Architecture
- Built on a base LLM (LLaMA‑2‑13B) and added four expert modules:
  - Memorization Expert – fine‑tuned on pure fact‑retrieval data.
  - Understanding Expert – fine‑tuned on conceptual Q&A.
  - Application Expert – fine‑tuned on scenario‑based reasoning.
  - Generalist Expert – retains the original base model capabilities.
- A lightweight router predicts the Bloom level of an incoming query and forwards it to the corresponding expert.
Evaluation
- Measured accuracy for multiple‑choice, BLEU/ROUGE for short answers, and exact‑match/structured‑reasoning scores for application tasks.
- Compared PolicyMoE against vanilla LLMs and against a single‑expert fine‑tuned baseline.

Results & Findings

Model	Memorization (Acc.)	Understanding (Acc.)	Application (Acc.)
GPT‑4 (zero‑shot)	92%	78%	61%
LLaMA‑2‑13B (fine‑tuned)	88%	71%	55%
PolicyMoE (ours)	90%	77%	71%
Single‑expert fine‑tune	89%	73%	58%

PolicyMoE closes the gap on the hardest “Application” tier, outperforming even GPT‑4 by 10 percentage points on scenario‑based reasoning.
All models perform well on pure memorization, confirming that LLMs already encode large amounts of policy text.
Understanding scores lag behind memorization, indicating that models struggle with nuanced interpretation (e.g., policy intent, trade‑off analysis).
Error analysis shows common failure modes: mis‑identifying jurisdiction, conflating similar statutes, and overlooking implicit constraints in scenario questions.

Practical Implications

Compliance Assistants: Developers can embed PolicyMoE as a back‑end for tools that automatically check whether a product, service, or data pipeline complies with relevant regulations (e.g., GDPR‑style rules in China vs. US).
Policy Drafting Support: The model’s “Application” expertise can generate first‑draft impact assessments or suggest policy alternatives, accelerating the legislative research workflow.
Decision‑Support Dashboards: Enterprises can query the system for concise explanations of policy changes (e.g., new emissions standards) and receive structured recommendations on required actions.
Cross‑jurisdictional AI Governance: Because the benchmark covers both US and Chinese policy ecosystems, the approach can be extended to other regulatory regimes, helping multinational firms navigate a patchwork of rules with a single AI service.
Fine‑tuning Blueprint: The MoE routing strategy offers a reusable pattern for any domain where tasks span low‑level fact retrieval to high‑level problem solving (e.g., medical guidelines, financial regulations).

Limitations & Future Work

Jurisdiction Scope: The benchmark currently focuses on the US and China; other legal systems (EU, India, etc.) are not represented.
Static Knowledge: Policy texts evolve rapidly; the model does not incorporate real‑time updates or retrieval‑augmented mechanisms.
Explainability: While PolicyMoE improves performance, the internal reasoning of each expert remains a black box; future work could integrate chain‑of‑thought prompting or symbolic reasoning layers.
Evaluation Diversity: The current tasks are primarily multiple‑choice or short‑answer; richer interactive simulations (e.g., policy negotiation games) could further stress‑test LLMs.
Scalability of MoE: Adding more experts for finer granularity (e.g., sector‑specific experts) may increase latency; research into more efficient routing or sparsity techniques is needed.

Bottom line: PolicyBench and PolicyMoE provide the first concrete yardstick and architectural recipe for building LLMs that can do more than recite statutes—they can reason about policy in ways that matter to developers building compliance, governance, and decision‑support systems.

Authors

Han Bao
Penghao Zhang
Yue Huang
Zhengqing Yuan
Yanchi Ru
Rui Su
Yujun Zhou
Xiangqi Wang
Kehan Guo
Nitesh V Chawla
Yanfang Ye
Xiangliang Zhang

Paper Information

arXiv ID: 2604.12995v1
Categories: cs.CL, cs.CY
Published: April 14, 2026
PDF: Download PDF

[Paper] PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text