[Paper] GLiGuard: Schema-Conditioned Classification for LLM Safeguard
Source: arXiv - 2605.07982v1
Overview
The paper introduces GLiGuard, a lightweight (0.3 B parameters) encoder‑only model that treats safety‑related content moderation as a classification problem rather than a generative one. By embedding task definitions and label meanings directly into the input as a structured “schema,” GLiGuard can evaluate dozens of safety dimensions in a single forward pass, delivering guard‑rail performance comparable to much larger (7 B–27 B) decoder models while slashing latency and cost.
Key Contributions
- Schema‑conditioned encoding: Packs prompt‑safety, response‑safety, refusal detection, 14 fine‑grained harm categories, and 11 jailbreak strategies into a single token schema fed to a bidirectional encoder.
- Compact architecture: Adapts the GLiNER2 encoder (≈300 M parameters) to safety classification, achieving a 23–90× size reduction versus typical guard models.
- Multi‑aspect evaluation in one pass: Simultaneously predicts all safety signals without autoregressive decoding, enabling up to 16× higher throughput and 17× lower latency.
- Competitive accuracy: Matches or exceeds F1 scores of 7 B–27 B decoder‑based guards across nine established safety benchmarks.
- Open‑source release: Code, pretrained weights, and schema templates are publicly available, encouraging reproducibility and community extensions.
Methodology
- Task framing – The authors recast safety moderation as a multi‑label classification problem. Each safety dimension (e.g., “sexual content,” “political persuasion,” “jailbreak attempt”) is treated as a binary label.
- Schema construction – For a given user prompt and model response, a structured token schema is built that includes:
- A task definition block describing what the model should assess (e.g., “Is the response safe?”).
- Label blocks that enumerate possible categories with short textual descriptors.
- The input text (prompt + response) placed after the schema.
This schema is tokenized and fed to the encoder as a single sequence.
- Encoder adaptation – Starting from GLiNER2, the authors fine‑tune the bidirectional encoder on a curated safety dataset covering the 14 harm categories and 11 jailbreak tactics. The model outputs a vector of logits, one per label, which are thresholded to produce binary decisions.
- Inference flexibility – Because the schema lives in the input, new safety categories can be added or existing ones modified without retraining the core model—just by editing the schema text.
- Evaluation – The system is benchmarked on nine public safety datasets (e.g., SafeRLHF, Anthropic’s HH, jailbreak corpora). Metrics focus on macro‑averaged F1, latency (ms), and throughput (queries / second).
Results & Findings
| Metric | GLiGuard (0.3 B) | 7 B Decoder Guard | 27 B Decoder Guard |
|---|---|---|---|
| Avg. F1 (across benchmarks) | 0.84 | 0.85 | 0.86 |
| Latency (ms per query) | ≈30 ms | ≈500 ms | ≈800 ms |
| Throughput (qps on A100) | ≈1,200 | ≈75 | ≈45 |
| Parameter count | 300 M | 7 B | 27 B |
- Accuracy: GLiGuard’s F1 is within 1–2 % of the much larger models, demonstrating that a well‑conditioned encoder can capture nuanced safety signals.
- Speed: Non‑autoregressive inference yields up to 16× higher throughput and 17× lower latency, making real‑time moderation feasible even at massive request volumes.
- Scalability: Adding new label blocks to the schema does not degrade performance, confirming the flexibility of the design.
Practical Implications
- Cost‑effective moderation – Deploying a 300 M‑parameter guard reduces GPU memory footprints and cloud‑compute bills dramatically, enabling startups and edge services to embed safety checks without expensive hardware.
- Real‑time user‑facing apps – Chatbots, code assistants, and generative search interfaces can enforce multi‑aspect safety policies without noticeable lag, improving user trust.
- Rapid policy updates – Companies can roll out new safety categories (e.g., emerging disinformation tactics) by simply updating the schema template, sidestepping lengthy model retraining cycles.
- Multi‑modal pipelines – Because GLiGuard is encoder‑only, it can be stacked with other encoders (e.g., retrieval or embedding models) in a single inference graph, further streamlining end‑to‑end pipelines.
- Open‑source ecosystem – The released codebase invites community contributions—custom schemas, domain‑specific fine‑tuning, or integration with existing LLM serving stacks (e.g., vLLM, TGI).
Limitations & Future Work
- Domain coverage – The training data, while broad, may miss niche or rapidly evolving harmful content types; performance could degrade on out‑of‑distribution prompts.
- Binary labeling granularity – The current schema outputs binary decisions per category; richer confidence scores or hierarchical labeling could improve downstream handling.
- Encoder capacity ceiling – Although 0.3 B works well now, scaling to hundreds of safety dimensions may eventually require larger encoders or more sophisticated schema designs.
- Adversarial robustness – The paper notes that sophisticated jailbreaks that deliberately obfuscate intent can still slip through; future work aims to incorporate adversarial training and dynamic schema adaptation.
Overall, GLiGuard demonstrates that a thoughtfully conditioned encoder can deliver industrial‑grade safety moderation at a fraction of the compute cost, opening the door for broader, real‑time deployment of trustworthy LLM services.
Authors
- Urchade Zaratiana
- Mary Newhauser
- George Hurn-Maloney
- Ash Lewis
Paper Information
- arXiv ID: 2605.07982v1
- Categories: cs.CL, cs.CR
- Published: May 8, 2026
- PDF: Download PDF