[Paper] When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks

Published: 1 day ago (December 8, 2025 at 11:22 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07684v1

Overview

Incivility—ranging from toxic language to personal attacks—continues to erode the health of online communities. While large language models (LLMs) dominate most text‑classification pipelines, they often ignore the relational context that shapes how hostile remarks spread. This paper demonstrates that a graph‑neural‑network (GNN) approach, which treats each comment as a node linked by textual similarity, can out‑perform 12 leading LLMs in spotting toxicity, aggression, and personal attacks on English Wikipedia, and it does so with a fraction of the inference cost.

Key Contributions

Graph‑centric representation: Models each comment as a graph node and connects nodes based on semantic similarity, capturing conversational structure that pure‑text models miss.
Dynamic attention fusion: Introduces a learnable attention mechanism that automatically balances node‑level (textual) features with topological (graph) cues during message passing.
Comprehensive benchmark: Evaluates the GNN against 12 state‑of‑the‑art LLMs (e.g., GPT‑4, PaLM, LLaMA) on three incivility categories, reporting consistent gains across precision, recall, and F1.
Efficiency advantage: Shows up to 6× lower latency and 4× lower GPU memory at inference time compared with the best‑performing LLM baselines.
Open resources: Releases the constructed comment similarity graph, training scripts, and full prediction logs to foster reproducibility and downstream tooling.

Methodology

Data preprocessing – English Wikipedia talk‑page comments are labeled for toxicity, aggression, and personal attacks.
Graph construction – Each comment becomes a node. Pairwise cosine similarity of sentence embeddings (e.g., SBERT) determines edge weights; a threshold prunes weak links, yielding a sparse, scalable graph.
Node encoding – A lightweight transformer encoder (≈ 12 M parameters) converts raw text into a fixed‑size vector.
Message passing – A multi‑layer GNN (GraphSAGE‑style) aggregates neighbor information. At each layer, a dynamic attention module computes two scores: one for the node’s own embedding and one for the aggregated neighbor embedding, then blends them.
Classification head – The final node representation feeds into three sigmoid outputs (one per incivility type). The model is trained with a weighted binary cross‑entropy loss to handle class imbalance.

The pipeline is end‑to‑end differentiable, yet it remains modular: developers can swap the text encoder or the graph aggregation scheme without redesigning the whole system.

Results & Findings

Model	Toxicity F1	Aggression F1	Personal‑Attack F1	Avg. Inference Latency (ms)
GPT‑4 (zero‑shot)	0.71	0.68	0.65	210
LLaMA‑13B (fine‑tuned)	0.74	0.70	0.68	180
Proposed GNN	0.81	0.78	0.76	35

The GNN improves macro‑averaged F1 by ~9–12 points over the strongest LLM baseline.
Ablation studies reveal that removing the graph edges drops performance by ~5 F1 points, confirming the value of relational context.
The dynamic attention module contributes ~2 F1 points over a static averaging scheme, indicating that the model benefits from adaptively weighting text vs. structure per comment.
Memory footprint during inference stays under 2 GB on a single RTX 3080, compared to >8 GB for the LLMs.

Practical Implications

Moderation tooling: Platforms can embed the GNN as a lightweight microservice that flags potentially uncivil comments in real time, reducing reliance on costly LLM APIs.
Scalable pipelines: Because the graph can be incrementally updated (new comments added as nodes, edges recomputed locally), the system scales to high‑traffic forums without rebuilding the entire model.
Explainability: Edge weights expose which past comments influenced a prediction, giving moderators a traceable “reason” that pure LLM scores lack.
Cost savings: With up to 6× lower latency and GPU usage, organizations can cut cloud inference bills dramatically while maintaining higher detection quality.
Cross‑domain adaptability: The same graph‑centric recipe can be applied to other behavioral signals—spam, misinformation, or hate speech—by simply redefining edge criteria (e.g., user interaction graphs, temporal proximity).

Limitations & Future Work

Graph construction overhead: Computing pairwise embeddings for massive streams can become a bottleneck; the authors suggest approximate nearest‑neighbor indexing as a next step.
Language scope: Experiments are limited to English Wikipedia; multilingual extensions will require language‑agnostic similarity measures.
Edge definition rigidity: Using only textual similarity may miss other relational cues (e.g., reply‑to structure, user reputation). Future work could fuse heterogeneous edge types into a richer heterogeneous graph.
Robustness to adversarial attacks: The paper notes that deliberately crafted comments that mimic benign language could evade similarity‑based edges; adversarial training is proposed as a mitigation strategy.

Authors

Zihan Chen
Lanyu Yu

Paper Information

arXiv ID: 2512.07684v1
Categories: cs.CL, cs.AI, cs.SI
Published: December 8, 2025
PDF: Download PDF

[Paper] When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

[Paper] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

[Paper] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

[Paper] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs