[Paper] GitSearch: Enhancing Community Notes Generation with Gap-Informed Targeted Search

Published: 3 days ago (February 9, 2026 at 12:42 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.08945v1

Overview

The paper presents GitSearch, a novel “gap‑informed” system that automatically generates Community Notes (the fact‑checking annotations used on platforms like X/Twitter). By first spotting what information is missing in a post, then pulling targeted evidence from the web, and finally crafting a platform‑compliant note, GitSearch dramatically expands coverage while keeping the quality high enough to beat many human‑written notes.

Key Contributions

Gap‑Informed Retrieval Pipeline – Introduces a three‑stage workflow (gap detection → targeted web search → note synthesis) that treats human‑perceived information deficits as first‑class signals.
PolBench Dataset – Releases a large‑scale benchmark (78,698 U.S. political tweets paired with Community Notes) for evaluating automated note‑generation systems.
Coverage Boost – Achieves ~99 % coverage of tweets needing notes, nearly double the best prior AI baseline.
Quality Gains – Outperforms human‑authored helpful notes in head‑to‑head A/B tests (69 % win rate) and attains higher helpfulness scores (3.87 vs. 3.36 on a 5‑point scale).
Real‑Time Targeted Search – Demonstrates that on‑the‑fly web retrieval, guided by identified gaps, can supply reliable evidence without pre‑indexed corpora.

Methodology

Gap Identification – A lightweight classifier scans a tweet and flags specific deficits (e.g., missing context, unsupported claim, ambiguous source). The output is a structured “gap schema” that the rest of the system consumes.
Targeted Web Retrieval – For each gap, GitSearch formulates a concise, intent‑rich query (e.g., “official statement on X policy March 2024”) and sends it to a commercial search API. The top‑k results are re‑ranked using a relevance model that weighs factual alignment and source credibility.
Note Synthesis – A language model (fine‑tuned on existing Community Notes) takes the original tweet, the gap schema, and the retrieved snippets, then generates a note that satisfies the platform’s style constraints (concise, neutral, citation‑rich). A post‑processing filter ensures no policy violations (e.g., hate speech, personal attacks).

All components run in a streaming fashion, keeping end‑to‑end latency low enough for real‑time moderation pipelines.

Results & Findings

Coverage: GitSearch produces notes for 99 % of the 78,698 tweets in PolBench, compared to ~52 % for the previous state‑of‑the‑art model.
Helpfulness: In a crowd‑sourced evaluation, the system’s notes received an average helpfulness rating of 3.87/5, surpassing the human baseline of 3.36/5.
Win‑Rate Against Humans: In pairwise comparisons, GitSearch notes were preferred over human‑written notes 69 % of the time.
Retrieval Effectiveness: The targeted search module achieved a precision@5 of 0.78 for gap‑relevant evidence, indicating that most retrieved snippets were directly usable for note composition.
Latency: End‑to‑end processing averaged 1.8 seconds per tweet, fitting comfortably within typical moderation time windows.

Practical Implications

Scalable Fact‑Checking – Platforms can deploy GitSearch to automatically generate high‑quality Community Notes for the vast majority of posts, reducing reliance on volunteer moderators.
Developer Integration – The pipeline is API‑first: gap detection, search, and synthesis are exposed as micro‑services, making it straightforward to plug into existing moderation stacks or content‑creation tools.
Improved User Trust – By delivering timely, evidence‑backed notes, platforms can curb misinformation spread faster, potentially lowering the virality of false claims.
Customizable Gap Schemas – Organizations can extend the gap taxonomy (e.g., “missing legal context”) to suit domain‑specific moderation policies without retraining the whole system.
Open Benchmark – PolBench gives engineers a concrete dataset to benchmark their own retrieval‑augmented generation (RAG) models against a real‑world moderation task.

Limitations & Future Work

Source Reliability – The current system trusts top‑ranked search results based on generic credibility signals; adversarial manipulation of search rankings could inject low‑quality evidence.
Domain Generalization – While evaluated on U.S. political tweets, performance on non‑political or non‑English content remains untested.
Human Oversight – The authors note that a fallback human review step is still advisable for high‑impact claims, especially where legal liability is a concern.
Future Directions – Planned extensions include incorporating multi‑modal evidence (images, videos), tighter integration with platform policy engines, and adversarial robustness testing against misinformation campaigns.

Authors

Sahajpreet Singh
Kokil Jaidka
Min-Yen Kan

Paper Information

arXiv ID: 2602.08945v1
Categories: cs.CL, cs.CY
Published: February 9, 2026
PDF: Download PDF

[Paper] GitSearch: Enhancing Community Notes Generation with Gap-Informed Targeted Search

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

[Paper] Weight Decay Improves Language Model Plasticity

[Paper] Just on Time: Token-Level Early Stopping for Diffusion Language Models