[Paper] GitSearch: Enhancing Community Notes Generation with Gap-Informed Targeted Search
Source: arXiv - 2602.08945v1
Overview
The paper presents GitSearch, a novel “gap‑informed” system that automatically generates Community Notes (the fact‑checking annotations used on platforms like X/Twitter). By first spotting what information is missing in a post, then pulling targeted evidence from the web, and finally crafting a platform‑compliant note, GitSearch dramatically expands coverage while keeping the quality high enough to beat many human‑written notes.
Key Contributions
- Gap‑Informed Retrieval Pipeline – Introduces a three‑stage workflow (gap detection → targeted web search → note synthesis) that treats human‑perceived information deficits as first‑class signals.
- PolBench Dataset – Releases a large‑scale benchmark (78,698 U.S. political tweets paired with Community Notes) for evaluating automated note‑generation systems.
- Coverage Boost – Achieves ~99 % coverage of tweets needing notes, nearly double the best prior AI baseline.
- Quality Gains – Outperforms human‑authored helpful notes in head‑to‑head A/B tests (69 % win rate) and attains higher helpfulness scores (3.87 vs. 3.36 on a 5‑point scale).
- Real‑Time Targeted Search – Demonstrates that on‑the‑fly web retrieval, guided by identified gaps, can supply reliable evidence without pre‑indexed corpora.
Methodology
- Gap Identification – A lightweight classifier scans a tweet and flags specific deficits (e.g., missing context, unsupported claim, ambiguous source). The output is a structured “gap schema” that the rest of the system consumes.
- Targeted Web Retrieval – For each gap, GitSearch formulates a concise, intent‑rich query (e.g., “official statement on X policy March 2024”) and sends it to a commercial search API. The top‑k results are re‑ranked using a relevance model that weighs factual alignment and source credibility.
- Note Synthesis – A language model (fine‑tuned on existing Community Notes) takes the original tweet, the gap schema, and the retrieved snippets, then generates a note that satisfies the platform’s style constraints (concise, neutral, citation‑rich). A post‑processing filter ensures no policy violations (e.g., hate speech, personal attacks).
All components run in a streaming fashion, keeping end‑to‑end latency low enough for real‑time moderation pipelines.
Results & Findings
- Coverage: GitSearch produces notes for 99 % of the 78,698 tweets in PolBench, compared to ~52 % for the previous state‑of‑the‑art model.
- Helpfulness: In a crowd‑sourced evaluation, the system’s notes received an average helpfulness rating of 3.87/5, surpassing the human baseline of 3.36/5.
- Win‑Rate Against Humans: In pairwise comparisons, GitSearch notes were preferred over human‑written notes 69 % of the time.
- Retrieval Effectiveness: The targeted search module achieved a precision@5 of 0.78 for gap‑relevant evidence, indicating that most retrieved snippets were directly usable for note composition.
- Latency: End‑to‑end processing averaged 1.8 seconds per tweet, fitting comfortably within typical moderation time windows.
Practical Implications
- Scalable Fact‑Checking – Platforms can deploy GitSearch to automatically generate high‑quality Community Notes for the vast majority of posts, reducing reliance on volunteer moderators.
- Developer Integration – The pipeline is API‑first: gap detection, search, and synthesis are exposed as micro‑services, making it straightforward to plug into existing moderation stacks or content‑creation tools.
- Improved User Trust – By delivering timely, evidence‑backed notes, platforms can curb misinformation spread faster, potentially lowering the virality of false claims.
- Customizable Gap Schemas – Organizations can extend the gap taxonomy (e.g., “missing legal context”) to suit domain‑specific moderation policies without retraining the whole system.
- Open Benchmark – PolBench gives engineers a concrete dataset to benchmark their own retrieval‑augmented generation (RAG) models against a real‑world moderation task.
Limitations & Future Work
- Source Reliability – The current system trusts top‑ranked search results based on generic credibility signals; adversarial manipulation of search rankings could inject low‑quality evidence.
- Domain Generalization – While evaluated on U.S. political tweets, performance on non‑political or non‑English content remains untested.
- Human Oversight – The authors note that a fallback human review step is still advisable for high‑impact claims, especially where legal liability is a concern.
- Future Directions – Planned extensions include incorporating multi‑modal evidence (images, videos), tighter integration with platform policy engines, and adversarial robustness testing against misinformation campaigns.
Authors
- Sahajpreet Singh
- Kokil Jaidka
- Min-Yen Kan
Paper Information
- arXiv ID: 2602.08945v1
- Categories: cs.CL, cs.CY
- Published: February 9, 2026
- PDF: Download PDF