[Paper] Decision Quality Evaluation Framework at Pinterest

Published: 2 months ago (February 17, 2026 at 01:45 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

Pinterest’s safety team needed a reliable way to gauge how well both human moderators and large‑language‑model (LLM) agents were enforcing ever‑changing content policies.

The authors describe a Decision Quality Evaluation Framework that:

Turns subjective “gut‑feel” checks into a data‑driven, quantitative process.
Uses a high‑trust Golden Set as ground truth.
Employs intelligent sampling to keep evaluation costs low while preserving confidence.

Key Contributions

Golden Set (GDS) benchmark – a curated, high‑trust dataset built by subject‑matter experts that serves as the gold standard for decision quality.
Propensity‑score‑based sampling pipeline – an automated system that selects the most informative moderation cases, dramatically expanding coverage without a linear increase in labeling cost.
Cost‑performance benchmarking for LLM agents – a systematic method to compare different LLM‑based moderation bots on both expense and accuracy.
Data‑driven prompt optimization workflow – a quantitative feedback loop that tunes LLM prompts based on measured decision quality.
Policy‑evolution management – tools to detect and quantify drift when policies are updated, ensuring historic metrics stay comparable.
Continuous validation of prevalence metrics – ongoing checks that keep platform‑wide content‑safety statistics trustworthy.

Methodology

Golden Set Creation
- SMEs manually label a diverse set of content items (images, text, video snippets) according to the latest policy.
- These labels serve as ground truth because they undergo multiple rounds of review and consensus building.
Intelligent Sampling
- For the massive daily stream of content, the system computes a propensity score that estimates how likely a piece of content is to be mis‑classified by the current moderation pipeline.
- Items with high scores are preferentially sent to the Golden Set labeling queue, while low‑risk items are sampled sparsely.
- This focuses human effort where it matters most.
Evaluation Loop
- Moderation decisions from humans and LLM agents are compared against the Golden Set.
- Standard metrics (precision, recall, F1) are calculated, and cost metrics (e.g., API calls, human‑hour spend) are logged.
Benchmarking & Optimization
- The framework aggregates results across multiple LLM configurations (different model sizes, temperature settings, prompt templates).
- Decision quality is plotted against cost, allowing product managers to pick the “sweet spot.”
- Prompt changes are rolled out, re‑evaluated, and the best‑performing version is promoted.
Policy Drift Detection
- When a policy is revised, the system re‑labels a subset of the Golden Set under the new rules.
- It then measures the shift in decision distributions, flagging any unexpected drops in quality.
Prevalence Metric Validation
- Daily prevalence numbers (e.g., % of posts flagged as “spam”) are cross‑checked against the Golden Set to catch systematic bias early.

Results & Findings

Sampling Efficiency: Using propensity scores reduced the number of items needing expert review by ~70 % while preserving >95 % coverage of the most error‑prone cases.
LLM Cost‑Performance Trade‑off: A 13‑billion‑parameter LLM achieved 92 % F1 at 3× the cost of a 6‑billion‑parameter model; the smaller model’s 86 % F1 met the product’s SLA, delivering a 40 % cost saving.
Prompt Optimization Gains: Iterative prompt tuning raised the LLM’s precision from 78 % to 88 % without any model‑size change, demonstrating the value of a data‑driven feedback loop.
Policy Evolution Impact: After a major policy rewrite, the framework detected a 5 % dip in recall within 48 hours, prompting a rapid rollback of an ambiguous rule clause.
Metric Integrity: Continuous validation caught a drift where the prevalence of “misinformation” was under‑reported by 12 % due to a subtle labeling bias, leading to a corrective re‑training of the moderation classifier.

Practical Implications

Scalable Quality Assurance – Teams can keep high moderation standards without linearly increasing human‑review budgets, thanks to targeted sampling.
Informed LLM Deployment – Product managers obtain a clear ROI view when selecting LLM providers or model sizes, balancing latency, cost, and safety.
Rapid Policy Iteration – The framework’s drift‑detection lets policy teams experiment with new rules and instantly see real‑world impact, shortening the policy‑to‑production cycle.
Trustworthy Platform Metrics – Continuous validation keeps public‑safety dashboards (e.g., “X % of pins removed for hate speech”) accurate, which is essential for regulator reporting and user trust.
Reusable Blueprint – The modular design (Golden Set, sampling engine, evaluation dashboard) can be transplanted to other content‑moderation pipelines—social networks, marketplaces, or comment sections—accelerating their safety‑engineering efforts.

Limitations & Future Work

Golden‑Set Maintenance – Keeping the expert‑curated set up‑to‑date requires ongoing SME effort; there is a risk that the set may lag behind fast‑moving policy changes.
Sampling‑Bias Risks – Propensity scoring focuses on high‑risk items, which may inadvertently overlook emerging abuse patterns that initially receive low scores.
LLM Explainability – The framework evaluates outcomes but does not yet reveal why an LLM made a particular decision, limiting the ability to debug model behavior.
Cross‑Modal Extension – Current work concentrates on image and text moderation; extending the pipeline to video and audio is earmarked for future research.
Open‑Source Tooling – The authors plan to release parts of the sampling and evaluation stack as open‑source components to foster community adoption and peer validation.

Authors

Yuqi Tian
Robert Paine
Attila Dobi
Kevin O’Sullivan
Aravindh Manickavasagam
Faisal Farooq

Paper Information

Field	Details
arXiv ID	2602.15809v1
Categories	stat.AP, cs.AI
Published	February 17, 2026
PDF	Download PDF

[Paper] Decision Quality Evaluation Framework at Pinterest

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

How We Handle 'Gray Area' Logic in Conversational Agents

Making Wolfram Tech Available as a Foundation Tool for LLM Systems

Apple releases videos from its 2025 AI Reasoning and Planning Workshop

Why Your AI Trading Agent Needs a Memory — and How We Built One