[Paper] Decision Quality Evaluation Framework at Pinterest
Source: arXiv - 2602.15809v1
Overview
Pinterest’s safety team needed a reliable way to gauge how well both human moderators and large language model (LLM) agents were enforcing ever‑changing content policies. The authors describe a Decision Quality Evaluation Framework that turns subjective “gut‑feel” checks into a data‑driven, quantitative process, using a high‑trust “Golden Set” as ground truth and intelligent sampling to keep evaluation costs low while preserving confidence.
Key Contributions
- Golden Set (GDS) benchmark: A curated, high‑trust dataset built by subject‑matter experts that serves as the gold standard for decision quality.
- Propensity‑score‑based sampling pipeline: An automated system that selects the most informative moderation cases, dramatically expanding coverage without a linear increase in labeling cost.
- Cost‑performance benchmarking for LLM agents: A systematic method to compare different LLM‑based moderation bots on both expense and accuracy.
- Data‑driven prompt optimization workflow: Quantitative feedback loop that tunes LLM prompts based on measured decision quality.
- Policy‑evolution management: Tools to detect and quantify drift when policies are updated, ensuring historic metrics stay comparable.
- Continuous validation of prevalence metrics: Ongoing checks that keep platform‑wide content‑safety statistics trustworthy.
Methodology
- Golden Set Creation – SMEs manually label a diverse set of content items (images, text, video snippets) according to the latest policy. These labels are treated as ground truth because they undergo multiple rounds of review and consensus building.
- Intelligent Sampling – For the massive daily stream of content, the system computes a propensity score that estimates how likely a piece of content is to be mis‑classified by the current moderation pipeline. Items with high scores are preferentially sent to the Golden Set labeling queue, while low‑risk items are sampled sparsely. This focuses human effort where it matters most.
- Evaluation Loop – Moderation decisions from humans and LLM agents are compared against the Golden Set. Standard metrics (precision, recall, F1) are calculated, and cost metrics (e.g., API calls, human‑hour spend) are logged.
- Benchmarking & Optimization – The framework aggregates results across multiple LLM configurations (different model sizes, temperature settings, prompt templates). Decision quality is plotted against cost, allowing product managers to pick the “sweet spot.” Prompt changes are rolled out, re‑evaluated, and the best‑performing version is promoted.
- Policy Drift Detection – When a policy is revised, the system re‑labels a subset of the Golden Set under the new rules and measures the shift in decision distributions, flagging any unexpected drops in quality.
- Prevalence Metric Validation – Daily prevalence numbers (e.g., % of posts flagged as “spam”) are cross‑checked against the Golden Set to catch systematic bias early.
Results & Findings
- Sampling Efficiency: Using propensity scores reduced the number of items needing expert review by ~70 % while preserving >95 % coverage of the most error‑prone cases.
- LLM Cost‑Performance Trade‑off: A 13‑billion‑parameter LLM achieved 92 % F1 at 3× the cost of a 6‑billion‑parameter model, but the smaller model’s 86 % F1 met the product’s SLA, leading to a 40 % cost saving.
- Prompt Optimization Gains: Iterative prompt tuning raised the LLM’s precision from 78 % to 88 % without any model size change, demonstrating the value of a data‑driven feedback loop.
- Policy Evolution Impact: After a major policy rewrite, the framework detected a 5 % dip in recall within 48 hours, prompting a rapid rollback of an ambiguous rule clause.
- Metric Integrity: Continuous validation caught a drift where prevalence of “misinformation” was under‑reported by 12 % due to a subtle labeling bias, leading to a corrective re‑training of the moderation classifier.
Practical Implications
- Scalable Quality Assurance: Teams can maintain high moderation standards without linearly scaling human review budgets, thanks to targeted sampling.
- Informed LLM Deployment: Product managers gain a clear ROI view when choosing between LLM providers or model sizes, balancing latency, cost, and safety.
- Rapid Policy Iteration: The framework’s drift detection lets policy teams experiment with new rules and instantly see real‑world impact, shortening the policy‑to‑production cycle.
- Trustworthy Platform Metrics: Continuous validation ensures that public safety dashboards (e.g., “X % of pins removed for hate speech”) remain accurate, which is critical for regulator reporting and user trust.
- Reusable Blueprint: The modular design (Golden Set, sampling engine, evaluation dashboard) can be transplanted to other content‑moderation pipelines—social networks, marketplaces, or comment sections—accelerating their safety engineering efforts.
Limitations & Future Work
- Golden Set Maintenance: Keeping the expert‑curated set up‑to‑date requires ongoing SME effort; the authors note a risk of the set lagging behind fast‑moving policy changes.
- Sampling Bias Risks: While propensity scoring focuses on high‑risk items, it may inadvertently overlook emerging abuse patterns that have low initial scores.
- LLM Explainability: The framework evaluates outcomes but does not yet provide insight into why an LLM made a particular decision, limiting debugging of model behavior.
- Cross‑Modal Extension: Current work focuses mainly on image and text; extending the pipeline to video and audio moderation is earmarked for future research.
- Open‑Source Tooling: The authors plan to release parts of the sampling and evaluation stack as open‑source components to foster community adoption and peer validation.
Authors
- Yuqi Tian
- Robert Paine
- Attila Dobi
- Kevin O’Sullivan
- Aravindh Manickavasagam
- Faisal Farooq
Paper Information
- arXiv ID: 2602.15809v1
- Categories: stat.AP, cs.AI
- Published: February 17, 2026
- PDF: Download PDF