[Paper] GitSearch: Enhancing Community Notes Generation with Gap-Informed Targeted Search

Published: (February 9, 2026 at 12:42 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08945v1

Overview

The paper presents GitSearch, a novel “gap‑informed” system that automatically generates Community Notes (the fact‑checking annotations used on platforms like X/Twitter). By first spotting what information is missing in a post, then pulling targeted evidence from the web, and finally crafting a platform‑compliant note, GitSearch dramatically expands coverage while keeping the quality high enough to beat many human‑written notes.

Key Contributions

  • Gap‑Informed Retrieval Pipeline – Introduces a three‑stage workflow (gap detection → targeted web search → note synthesis) that treats human‑perceived information deficits as first‑class signals.
  • PolBench Dataset – Releases a large‑scale benchmark (78,698 U.S. political tweets paired with Community Notes) for evaluating automated note‑generation systems.
  • Coverage Boost – Achieves ~99 % coverage of tweets needing notes, nearly double the best prior AI baseline.
  • Quality Gains – Outperforms human‑authored helpful notes in head‑to‑head A/B tests (69 % win rate) and attains higher helpfulness scores (3.87 vs. 3.36 on a 5‑point scale).
  • Real‑Time Targeted Search – Demonstrates that on‑the‑fly web retrieval, guided by identified gaps, can supply reliable evidence without pre‑indexed corpora.

Methodology

  1. Gap Identification – A lightweight classifier scans a tweet and flags specific deficits (e.g., missing context, unsupported claim, ambiguous source). The output is a structured “gap schema” that the rest of the system consumes.
  2. Targeted Web Retrieval – For each gap, GitSearch formulates a concise, intent‑rich query (e.g., “official statement on X policy March 2024”) and sends it to a commercial search API. The top‑k results are re‑ranked using a relevance model that weighs factual alignment and source credibility.
  3. Note Synthesis – A language model (fine‑tuned on existing Community Notes) takes the original tweet, the gap schema, and the retrieved snippets, then generates a note that satisfies the platform’s style constraints (concise, neutral, citation‑rich). A post‑processing filter ensures no policy violations (e.g., hate speech, personal attacks).

All components run in a streaming fashion, keeping end‑to‑end latency low enough for real‑time moderation pipelines.

Results & Findings

  • Coverage: GitSearch produces notes for 99 % of the 78,698 tweets in PolBench, compared to ~52 % for the previous state‑of‑the‑art model.
  • Helpfulness: In a crowd‑sourced evaluation, the system’s notes received an average helpfulness rating of 3.87/5, surpassing the human baseline of 3.36/5.
  • Win‑Rate Against Humans: In pairwise comparisons, GitSearch notes were preferred over human‑written notes 69 % of the time.
  • Retrieval Effectiveness: The targeted search module achieved a precision@5 of 0.78 for gap‑relevant evidence, indicating that most retrieved snippets were directly usable for note composition.
  • Latency: End‑to‑end processing averaged 1.8 seconds per tweet, fitting comfortably within typical moderation time windows.

Practical Implications

  • Scalable Fact‑Checking – Platforms can deploy GitSearch to automatically generate high‑quality Community Notes for the vast majority of posts, reducing reliance on volunteer moderators.
  • Developer Integration – The pipeline is API‑first: gap detection, search, and synthesis are exposed as micro‑services, making it straightforward to plug into existing moderation stacks or content‑creation tools.
  • Improved User Trust – By delivering timely, evidence‑backed notes, platforms can curb misinformation spread faster, potentially lowering the virality of false claims.
  • Customizable Gap Schemas – Organizations can extend the gap taxonomy (e.g., “missing legal context”) to suit domain‑specific moderation policies without retraining the whole system.
  • Open Benchmark – PolBench gives engineers a concrete dataset to benchmark their own retrieval‑augmented generation (RAG) models against a real‑world moderation task.

Limitations & Future Work

  • Source Reliability – The current system trusts top‑ranked search results based on generic credibility signals; adversarial manipulation of search rankings could inject low‑quality evidence.
  • Domain Generalization – While evaluated on U.S. political tweets, performance on non‑political or non‑English content remains untested.
  • Human Oversight – The authors note that a fallback human review step is still advisable for high‑impact claims, especially where legal liability is a concern.
  • Future Directions – Planned extensions include incorporating multi‑modal evidence (images, videos), tighter integration with platform policy engines, and adversarial robustness testing against misinformation campaigns.

Authors

  • Sahajpreet Singh
  • Kokil Jaidka
  • Min-Yen Kan

Paper Information

  • arXiv ID: 2602.08945v1
  • Categories: cs.CL, cs.CY
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »