[Paper] Decision Quality Evaluation Framework at Pinterest

Published: (February 17, 2026 at 01:45 PM EST)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv - 2602.15809v1

Overview

Pinterest’s safety team needed a reliable way to gauge how well both human moderators and large‑language‑model (LLM) agents were enforcing ever‑changing content policies.

The authors describe a Decision Quality Evaluation Framework that:

  • Turns subjective “gut‑feel” checks into a data‑driven, quantitative process.
  • Uses a high‑trust Golden Set as ground truth.
  • Employs intelligent sampling to keep evaluation costs low while preserving confidence.

Key Contributions

  • Golden Set (GDS) benchmark – a curated, high‑trust dataset built by subject‑matter experts that serves as the gold standard for decision quality.
  • Propensity‑score‑based sampling pipeline – an automated system that selects the most informative moderation cases, dramatically expanding coverage without a linear increase in labeling cost.
  • Cost‑performance benchmarking for LLM agents – a systematic method to compare different LLM‑based moderation bots on both expense and accuracy.
  • Data‑driven prompt optimization workflow – a quantitative feedback loop that tunes LLM prompts based on measured decision quality.
  • Policy‑evolution management – tools to detect and quantify drift when policies are updated, ensuring historic metrics stay comparable.
  • Continuous validation of prevalence metrics – ongoing checks that keep platform‑wide content‑safety statistics trustworthy.

Methodology

  1. Golden Set Creation

    • SMEs manually label a diverse set of content items (images, text, video snippets) according to the latest policy.
    • These labels serve as ground truth because they undergo multiple rounds of review and consensus building.
  2. Intelligent Sampling

    • For the massive daily stream of content, the system computes a propensity score that estimates how likely a piece of content is to be mis‑classified by the current moderation pipeline.
    • Items with high scores are preferentially sent to the Golden Set labeling queue, while low‑risk items are sampled sparsely.
    • This focuses human effort where it matters most.
  3. Evaluation Loop

    • Moderation decisions from humans and LLM agents are compared against the Golden Set.
    • Standard metrics (precision, recall, F1) are calculated, and cost metrics (e.g., API calls, human‑hour spend) are logged.
  4. Benchmarking & Optimization

    • The framework aggregates results across multiple LLM configurations (different model sizes, temperature settings, prompt templates).
    • Decision quality is plotted against cost, allowing product managers to pick the “sweet spot.”
    • Prompt changes are rolled out, re‑evaluated, and the best‑performing version is promoted.
  5. Policy Drift Detection

    • When a policy is revised, the system re‑labels a subset of the Golden Set under the new rules.
    • It then measures the shift in decision distributions, flagging any unexpected drops in quality.
  6. Prevalence Metric Validation

    • Daily prevalence numbers (e.g., % of posts flagged as “spam”) are cross‑checked against the Golden Set to catch systematic bias early.

Results & Findings

  • Sampling Efficiency: Using propensity scores reduced the number of items needing expert review by ~70 % while preserving >95 % coverage of the most error‑prone cases.
  • LLM Cost‑Performance Trade‑off: A 13‑billion‑parameter LLM achieved 92 % F1 at the cost of a 6‑billion‑parameter model; the smaller model’s 86 % F1 met the product’s SLA, delivering a 40 % cost saving.
  • Prompt Optimization Gains: Iterative prompt tuning raised the LLM’s precision from 78 % to 88 % without any model‑size change, demonstrating the value of a data‑driven feedback loop.
  • Policy Evolution Impact: After a major policy rewrite, the framework detected a 5 % dip in recall within 48 hours, prompting a rapid rollback of an ambiguous rule clause.
  • Metric Integrity: Continuous validation caught a drift where the prevalence of “misinformation” was under‑reported by 12 % due to a subtle labeling bias, leading to a corrective re‑training of the moderation classifier.

Practical Implications

  • Scalable Quality Assurance – Teams can keep high moderation standards without linearly increasing human‑review budgets, thanks to targeted sampling.
  • Informed LLM Deployment – Product managers obtain a clear ROI view when selecting LLM providers or model sizes, balancing latency, cost, and safety.
  • Rapid Policy Iteration – The framework’s drift‑detection lets policy teams experiment with new rules and instantly see real‑world impact, shortening the policy‑to‑production cycle.
  • Trustworthy Platform Metrics – Continuous validation keeps public‑safety dashboards (e.g., “X % of pins removed for hate speech”) accurate, which is essential for regulator reporting and user trust.
  • Reusable Blueprint – The modular design (Golden Set, sampling engine, evaluation dashboard) can be transplanted to other content‑moderation pipelines—social networks, marketplaces, or comment sections—accelerating their safety‑engineering efforts.

Limitations & Future Work

  • Golden‑Set Maintenance – Keeping the expert‑curated set up‑to‑date requires ongoing SME effort; there is a risk that the set may lag behind fast‑moving policy changes.
  • Sampling‑Bias Risks – Propensity scoring focuses on high‑risk items, which may inadvertently overlook emerging abuse patterns that initially receive low scores.
  • LLM Explainability – The framework evaluates outcomes but does not yet reveal why an LLM made a particular decision, limiting the ability to debug model behavior.
  • Cross‑Modal Extension – Current work concentrates on image and text moderation; extending the pipeline to video and audio is earmarked for future research.
  • Open‑Source Tooling – The authors plan to release parts of the sampling and evaluation stack as open‑source components to foster community adoption and peer validation.

Authors

  • Yuqi Tian
  • Robert Paine
  • Attila Dobi
  • Kevin O’Sullivan
  • Aravindh Manickavasagam
  • Faisal Farooq

Paper Information

FieldDetails
arXiv ID2602.15809v1
Categoriesstat.AP, cs.AI
PublishedFebruary 17, 2026
PDFDownload PDF
0 views
Back to Blog

Related posts

Read more »