[Paper] When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks
Source: arXiv - 2512.07684v1
Overview
Incivility—ranging from toxic language to personal attacks—continues to erode the health of online communities. While large language models (LLMs) dominate most text‑classification pipelines, they often ignore the relational context that shapes how hostile remarks spread. This paper demonstrates that a graph‑neural‑network (GNN) approach, which treats each comment as a node linked by textual similarity, can out‑perform 12 leading LLMs in spotting toxicity, aggression, and personal attacks on English Wikipedia, and it does so with a fraction of the inference cost.
Key Contributions
- Graph‑centric representation: Models each comment as a graph node and connects nodes based on semantic similarity, capturing conversational structure that pure‑text models miss.
- Dynamic attention fusion: Introduces a learnable attention mechanism that automatically balances node‑level (textual) features with topological (graph) cues during message passing.
- Comprehensive benchmark: Evaluates the GNN against 12 state‑of‑the‑art LLMs (e.g., GPT‑4, PaLM, LLaMA) on three incivility categories, reporting consistent gains across precision, recall, and F1.
- Efficiency advantage: Shows up to 6× lower latency and 4× lower GPU memory at inference time compared with the best‑performing LLM baselines.
- Open resources: Releases the constructed comment similarity graph, training scripts, and full prediction logs to foster reproducibility and downstream tooling.
Methodology
- Data preprocessing – English Wikipedia talk‑page comments are labeled for toxicity, aggression, and personal attacks.
- Graph construction – Each comment becomes a node. Pairwise cosine similarity of sentence embeddings (e.g., SBERT) determines edge weights; a threshold prunes weak links, yielding a sparse, scalable graph.
- Node encoding – A lightweight transformer encoder (≈ 12 M parameters) converts raw text into a fixed‑size vector.
- Message passing – A multi‑layer GNN (GraphSAGE‑style) aggregates neighbor information. At each layer, a dynamic attention module computes two scores: one for the node’s own embedding and one for the aggregated neighbor embedding, then blends them.
- Classification head – The final node representation feeds into three sigmoid outputs (one per incivility type). The model is trained with a weighted binary cross‑entropy loss to handle class imbalance.
The pipeline is end‑to‑end differentiable, yet it remains modular: developers can swap the text encoder or the graph aggregation scheme without redesigning the whole system.
Results & Findings
| Model | Toxicity F1 | Aggression F1 | Personal‑Attack F1 | Avg. Inference Latency (ms) |
|---|---|---|---|---|
| GPT‑4 (zero‑shot) | 0.71 | 0.68 | 0.65 | 210 |
| LLaMA‑13B (fine‑tuned) | 0.74 | 0.70 | 0.68 | 180 |
| Proposed GNN | 0.81 | 0.78 | 0.76 | 35 |
- The GNN improves macro‑averaged F1 by ~9–12 points over the strongest LLM baseline.
- Ablation studies reveal that removing the graph edges drops performance by ~5 F1 points, confirming the value of relational context.
- The dynamic attention module contributes ~2 F1 points over a static averaging scheme, indicating that the model benefits from adaptively weighting text vs. structure per comment.
- Memory footprint during inference stays under 2 GB on a single RTX 3080, compared to >8 GB for the LLMs.
Practical Implications
- Moderation tooling: Platforms can embed the GNN as a lightweight microservice that flags potentially uncivil comments in real time, reducing reliance on costly LLM APIs.
- Scalable pipelines: Because the graph can be incrementally updated (new comments added as nodes, edges recomputed locally), the system scales to high‑traffic forums without rebuilding the entire model.
- Explainability: Edge weights expose which past comments influenced a prediction, giving moderators a traceable “reason” that pure LLM scores lack.
- Cost savings: With up to 6× lower latency and GPU usage, organizations can cut cloud inference bills dramatically while maintaining higher detection quality.
- Cross‑domain adaptability: The same graph‑centric recipe can be applied to other behavioral signals—spam, misinformation, or hate speech—by simply redefining edge criteria (e.g., user interaction graphs, temporal proximity).
Limitations & Future Work
- Graph construction overhead: Computing pairwise embeddings for massive streams can become a bottleneck; the authors suggest approximate nearest‑neighbor indexing as a next step.
- Language scope: Experiments are limited to English Wikipedia; multilingual extensions will require language‑agnostic similarity measures.
- Edge definition rigidity: Using only textual similarity may miss other relational cues (e.g., reply‑to structure, user reputation). Future work could fuse heterogeneous edge types into a richer heterogeneous graph.
- Robustness to adversarial attacks: The paper notes that deliberately crafted comments that mimic benign language could evade similarity‑based edges; adversarial training is proposed as a mitigation strategy.
Authors
- Zihan Chen
- Lanyu Yu
Paper Information
- arXiv ID: 2512.07684v1
- Categories: cs.CL, cs.AI, cs.SI
- Published: December 8, 2025
- PDF: Download PDF