[Paper] Decision Quality Evaluation Framework at Pinterest
Source: arXiv
Source: arXiv - 2602.15809v1
Overview
Pinterest’s safety team needed a reliable way to gauge how well both human moderators and large‑language‑model (LLM) agents were enforcing ever‑changing content policies.
The authors describe a Decision Quality Evaluation Framework that:
- Turns subjective “gut‑feel” checks into a data‑driven, quantitative process.
- Uses a high‑trust Golden Set as ground truth.
- Employs intelligent sampling to keep evaluation costs low while preserving confidence.
Key Contributions
- Golden Set (GDS) benchmark – a curated, high‑trust dataset built by subject‑matter experts that serves as the gold standard for decision quality.
- Propensity‑score‑based sampling pipeline – an automated system that selects the most informative moderation cases, dramatically expanding coverage without a linear increase in labeling cost.
- Cost‑performance benchmarking for LLM agents – a systematic method to compare different LLM‑based moderation bots on both expense and accuracy.
- Data‑driven prompt optimization workflow – a quantitative feedback loop that tunes LLM prompts based on measured decision quality.
- Policy‑evolution management – tools to detect and quantify drift when policies are updated, ensuring historic metrics stay comparable.
- Continuous validation of prevalence metrics – ongoing checks that keep platform‑wide content‑safety statistics trustworthy.
Methodology
Golden Set Creation
- SMEs manually label a diverse set of content items (images, text, video snippets) according to the latest policy.
- These labels serve as ground truth because they undergo multiple rounds of review and consensus building.
Intelligent Sampling
- For the massive daily stream of content, the system computes a propensity score that estimates how likely a piece of content is to be mis‑classified by the current moderation pipeline.
- Items with high scores are preferentially sent to the Golden Set labeling queue, while low‑risk items are sampled sparsely.
- This focuses human effort where it matters most.
Evaluation Loop
- Moderation decisions from humans and LLM agents are compared against the Golden Set.
- Standard metrics (precision, recall, F1) are calculated, and cost metrics (e.g., API calls, human‑hour spend) are logged.
Benchmarking & Optimization
- The framework aggregates results across multiple LLM configurations (different model sizes, temperature settings, prompt templates).
- Decision quality is plotted against cost, allowing product managers to pick the “sweet spot.”
- Prompt changes are rolled out, re‑evaluated, and the best‑performing version is promoted.
Policy Drift Detection
- When a policy is revised, the system re‑labels a subset of the Golden Set under the new rules.
- It then measures the shift in decision distributions, flagging any unexpected drops in quality.
Prevalence Metric Validation
- Daily prevalence numbers (e.g., % of posts flagged as “spam”) are cross‑checked against the Golden Set to catch systematic bias early.
Results & Findings
- Sampling Efficiency: Using propensity scores reduced the number of items needing expert review by ~70 % while preserving >95 % coverage of the most error‑prone cases.
- LLM Cost‑Performance Trade‑off: A 13‑billion‑parameter LLM achieved 92 % F1 at 3× the cost of a 6‑billion‑parameter model; the smaller model’s 86 % F1 met the product’s SLA, delivering a 40 % cost saving.
- Prompt Optimization Gains: Iterative prompt tuning raised the LLM’s precision from 78 % to 88 % without any model‑size change, demonstrating the value of a data‑driven feedback loop.
- Policy Evolution Impact: After a major policy rewrite, the framework detected a 5 % dip in recall within 48 hours, prompting a rapid rollback of an ambiguous rule clause.
- Metric Integrity: Continuous validation caught a drift where the prevalence of “misinformation” was under‑reported by 12 % due to a subtle labeling bias, leading to a corrective re‑training of the moderation classifier.
Practical Implications
- Scalable Quality Assurance – Teams can keep high moderation standards without linearly increasing human‑review budgets, thanks to targeted sampling.
- Informed LLM Deployment – Product managers obtain a clear ROI view when selecting LLM providers or model sizes, balancing latency, cost, and safety.
- Rapid Policy Iteration – The framework’s drift‑detection lets policy teams experiment with new rules and instantly see real‑world impact, shortening the policy‑to‑production cycle.
- Trustworthy Platform Metrics – Continuous validation keeps public‑safety dashboards (e.g., “X % of pins removed for hate speech”) accurate, which is essential for regulator reporting and user trust.
- Reusable Blueprint – The modular design (Golden Set, sampling engine, evaluation dashboard) can be transplanted to other content‑moderation pipelines—social networks, marketplaces, or comment sections—accelerating their safety‑engineering efforts.
Limitations & Future Work
- Golden‑Set Maintenance – Keeping the expert‑curated set up‑to‑date requires ongoing SME effort; there is a risk that the set may lag behind fast‑moving policy changes.
- Sampling‑Bias Risks – Propensity scoring focuses on high‑risk items, which may inadvertently overlook emerging abuse patterns that initially receive low scores.
- LLM Explainability – The framework evaluates outcomes but does not yet reveal why an LLM made a particular decision, limiting the ability to debug model behavior.
- Cross‑Modal Extension – Current work concentrates on image and text moderation; extending the pipeline to video and audio is earmarked for future research.
- Open‑Source Tooling – The authors plan to release parts of the sampling and evaluation stack as open‑source components to foster community adoption and peer validation.
Authors
- Yuqi Tian
- Robert Paine
- Attila Dobi
- Kevin O’Sullivan
- Aravindh Manickavasagam
- Faisal Farooq
Paper Information
| Field | Details |
|---|---|
| arXiv ID | 2602.15809v1 |
| Categories | stat.AP, cs.AI |
| Published | February 17, 2026 |
| Download PDF |