[Paper] GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Published: 3 days ago (May 8, 2026 at 12:44 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.07982v1

Overview

The paper introduces GLiGuard, a lightweight (0.3 B parameters) encoder‑only model that treats safety‑related content moderation as a classification problem rather than a generative one. By embedding task definitions and label meanings directly into the input as a structured “schema,” GLiGuard can evaluate dozens of safety dimensions in a single forward pass, delivering guard‑rail performance comparable to much larger (7 B–27 B) decoder models while slashing latency and cost.

Key Contributions

Schema‑conditioned encoding: Packs prompt‑safety, response‑safety, refusal detection, 14 fine‑grained harm categories, and 11 jailbreak strategies into a single token schema fed to a bidirectional encoder.
Compact architecture: Adapts the GLiNER2 encoder (≈300 M parameters) to safety classification, achieving a 23–90× size reduction versus typical guard models.
Multi‑aspect evaluation in one pass: Simultaneously predicts all safety signals without autoregressive decoding, enabling up to 16× higher throughput and 17× lower latency.
Competitive accuracy: Matches or exceeds F1 scores of 7 B–27 B decoder‑based guards across nine established safety benchmarks.
Open‑source release: Code, pretrained weights, and schema templates are publicly available, encouraging reproducibility and community extensions.

Methodology

Task framing – The authors recast safety moderation as a multi‑label classification problem. Each safety dimension (e.g., “sexual content,” “political persuasion,” “jailbreak attempt”) is treated as a binary label.
Schema construction – For a given user prompt and model response, a structured token schema is built that includes:
- A task definition block describing what the model should assess (e.g., “Is the response safe?”).
- Label blocks that enumerate possible categories with short textual descriptors.
- The input text (prompt + response) placed after the schema.
  This schema is tokenized and fed to the encoder as a single sequence.
Encoder adaptation – Starting from GLiNER2, the authors fine‑tune the bidirectional encoder on a curated safety dataset covering the 14 harm categories and 11 jailbreak tactics. The model outputs a vector of logits, one per label, which are thresholded to produce binary decisions.
Inference flexibility – Because the schema lives in the input, new safety categories can be added or existing ones modified without retraining the core model—just by editing the schema text.
Evaluation – The system is benchmarked on nine public safety datasets (e.g., SafeRLHF, Anthropic’s HH, jailbreak corpora). Metrics focus on macro‑averaged F1, latency (ms), and throughput (queries / second).

Results & Findings

Metric	GLiGuard (0.3 B)	7 B Decoder Guard	27 B Decoder Guard
Avg. F1 (across benchmarks)	0.84	0.85	0.86
Latency (ms per query)	≈30 ms	≈500 ms	≈800 ms
Throughput (qps on A100)	≈1,200	≈75	≈45
Parameter count	300 M	7 B	27 B

Accuracy: GLiGuard’s F1 is within 1–2 % of the much larger models, demonstrating that a well‑conditioned encoder can capture nuanced safety signals.
Speed: Non‑autoregressive inference yields up to 16× higher throughput and 17× lower latency, making real‑time moderation feasible even at massive request volumes.
Scalability: Adding new label blocks to the schema does not degrade performance, confirming the flexibility of the design.

Practical Implications

Cost‑effective moderation – Deploying a 300 M‑parameter guard reduces GPU memory footprints and cloud‑compute bills dramatically, enabling startups and edge services to embed safety checks without expensive hardware.
Real‑time user‑facing apps – Chatbots, code assistants, and generative search interfaces can enforce multi‑aspect safety policies without noticeable lag, improving user trust.
Rapid policy updates – Companies can roll out new safety categories (e.g., emerging disinformation tactics) by simply updating the schema template, sidestepping lengthy model retraining cycles.
Multi‑modal pipelines – Because GLiGuard is encoder‑only, it can be stacked with other encoders (e.g., retrieval or embedding models) in a single inference graph, further streamlining end‑to‑end pipelines.
Open‑source ecosystem – The released codebase invites community contributions—custom schemas, domain‑specific fine‑tuning, or integration with existing LLM serving stacks (e.g., vLLM, TGI).

Limitations & Future Work

Domain coverage – The training data, while broad, may miss niche or rapidly evolving harmful content types; performance could degrade on out‑of‑distribution prompts.
Binary labeling granularity – The current schema outputs binary decisions per category; richer confidence scores or hierarchical labeling could improve downstream handling.
Encoder capacity ceiling – Although 0.3 B works well now, scaling to hundreds of safety dimensions may eventually require larger encoders or more sophisticated schema designs.
Adversarial robustness – The paper notes that sophisticated jailbreaks that deliberately obfuscate intent can still slip through; future work aims to incorporate adversarial training and dynamic schema adaptation.

Overall, GLiGuard demonstrates that a thoughtfully conditioned encoder can deliver industrial‑grade safety moderation at a fraction of the compute cost, opening the door for broader, real‑time deployment of trustworthy LLM services.

Authors

Urchade Zaratiana
Mary Newhauser
George Hurn-Maloney
Ash Lewis

Paper Information

arXiv ID: 2605.07982v1
Categories: cs.CL, cs.CR
Published: May 8, 2026
PDF: Download PDF

[Paper] GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

[Paper] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation