[Paper] GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Published: (May 8, 2026 at 12:44 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.07982v1

Overview

The paper introduces GLiGuard, a lightweight (0.3 B parameters) encoder‑only model that treats safety‑related content moderation as a classification problem rather than a generative one. By embedding task definitions and label meanings directly into the input as a structured “schema,” GLiGuard can evaluate dozens of safety dimensions in a single forward pass, delivering guard‑rail performance comparable to much larger (7 B–27 B) decoder models while slashing latency and cost.

Key Contributions

  • Schema‑conditioned encoding: Packs prompt‑safety, response‑safety, refusal detection, 14 fine‑grained harm categories, and 11 jailbreak strategies into a single token schema fed to a bidirectional encoder.
  • Compact architecture: Adapts the GLiNER2 encoder (≈300 M parameters) to safety classification, achieving a 23–90× size reduction versus typical guard models.
  • Multi‑aspect evaluation in one pass: Simultaneously predicts all safety signals without autoregressive decoding, enabling up to 16× higher throughput and 17× lower latency.
  • Competitive accuracy: Matches or exceeds F1 scores of 7 B–27 B decoder‑based guards across nine established safety benchmarks.
  • Open‑source release: Code, pretrained weights, and schema templates are publicly available, encouraging reproducibility and community extensions.

Methodology

  1. Task framing – The authors recast safety moderation as a multi‑label classification problem. Each safety dimension (e.g., “sexual content,” “political persuasion,” “jailbreak attempt”) is treated as a binary label.
  2. Schema construction – For a given user prompt and model response, a structured token schema is built that includes:
    • A task definition block describing what the model should assess (e.g., “Is the response safe?”).
    • Label blocks that enumerate possible categories with short textual descriptors.
    • The input text (prompt + response) placed after the schema.
      This schema is tokenized and fed to the encoder as a single sequence.
  3. Encoder adaptation – Starting from GLiNER2, the authors fine‑tune the bidirectional encoder on a curated safety dataset covering the 14 harm categories and 11 jailbreak tactics. The model outputs a vector of logits, one per label, which are thresholded to produce binary decisions.
  4. Inference flexibility – Because the schema lives in the input, new safety categories can be added or existing ones modified without retraining the core model—just by editing the schema text.
  5. Evaluation – The system is benchmarked on nine public safety datasets (e.g., SafeRLHF, Anthropic’s HH, jailbreak corpora). Metrics focus on macro‑averaged F1, latency (ms), and throughput (queries / second).

Results & Findings

MetricGLiGuard (0.3 B)7 B Decoder Guard27 B Decoder Guard
Avg. F1 (across benchmarks)0.840.850.86
Latency (ms per query)≈30 ms≈500 ms≈800 ms
Throughput (qps on A100)≈1,200≈75≈45
Parameter count300 M7 B27 B
  • Accuracy: GLiGuard’s F1 is within 1–2 % of the much larger models, demonstrating that a well‑conditioned encoder can capture nuanced safety signals.
  • Speed: Non‑autoregressive inference yields up to 16× higher throughput and 17× lower latency, making real‑time moderation feasible even at massive request volumes.
  • Scalability: Adding new label blocks to the schema does not degrade performance, confirming the flexibility of the design.

Practical Implications

  • Cost‑effective moderation – Deploying a 300 M‑parameter guard reduces GPU memory footprints and cloud‑compute bills dramatically, enabling startups and edge services to embed safety checks without expensive hardware.
  • Real‑time user‑facing apps – Chatbots, code assistants, and generative search interfaces can enforce multi‑aspect safety policies without noticeable lag, improving user trust.
  • Rapid policy updates – Companies can roll out new safety categories (e.g., emerging disinformation tactics) by simply updating the schema template, sidestepping lengthy model retraining cycles.
  • Multi‑modal pipelines – Because GLiGuard is encoder‑only, it can be stacked with other encoders (e.g., retrieval or embedding models) in a single inference graph, further streamlining end‑to‑end pipelines.
  • Open‑source ecosystem – The released codebase invites community contributions—custom schemas, domain‑specific fine‑tuning, or integration with existing LLM serving stacks (e.g., vLLM, TGI).

Limitations & Future Work

  • Domain coverage – The training data, while broad, may miss niche or rapidly evolving harmful content types; performance could degrade on out‑of‑distribution prompts.
  • Binary labeling granularity – The current schema outputs binary decisions per category; richer confidence scores or hierarchical labeling could improve downstream handling.
  • Encoder capacity ceiling – Although 0.3 B works well now, scaling to hundreds of safety dimensions may eventually require larger encoders or more sophisticated schema designs.
  • Adversarial robustness – The paper notes that sophisticated jailbreaks that deliberately obfuscate intent can still slip through; future work aims to incorporate adversarial training and dynamic schema adaptation.

Overall, GLiGuard demonstrates that a thoughtfully conditioned encoder can deliver industrial‑grade safety moderation at a fraction of the compute cost, opening the door for broader, real‑time deployment of trustworthy LLM services.

Authors

  • Urchade Zaratiana
  • Mary Newhauser
  • George Hurn-Maloney
  • Ash Lewis

Paper Information

  • arXiv ID: 2605.07982v1
  • Categories: cs.CL, cs.CR
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »