[Paper] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Published: (March 5, 2026 at 01:25 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.04857v1

Overview

The paper introduces FireBench, a new benchmark that measures how well large language models (LLMs) follow precise instructions in real‑world enterprise and API‑driven contexts. Unlike existing chat‑assistant benchmarks that focus on free‑form conversation, FireBench zeroes in on the strict formatting, content, and procedural constraints that businesses need for reliable automation.

Key Contributions

  • Enterprise‑focused benchmark: 2,400+ curated samples drawn from actual information‑extraction pipelines, customer‑support ticket handling, and code‑generation agents.
  • Six capability dimensions: evaluates format compliance, content fidelity, safety constraints, multi‑step reasoning, API call correctness, and error‑handling behavior.
  • Broad model coverage: systematic evaluation of 11 publicly available LLMs (including open‑source and commercial offerings).
  • Open‑source release: benchmark data, evaluation scripts, and a public leaderboard are available at fire‑bench.com.
  • Diagnostic insights: detailed analysis of failure modes that are specific to enterprise workflows (e.g., mismatched JSON schemas, omitted required fields).

Methodology

  1. Use‑case selection – The authors surveyed internal tooling and public APIs used by enterprises (e.g., ticketing systems, data‑extraction services, code‑review bots) and distilled six representative tasks.
  2. Prompt design – For each task, a clear instruction template was written, specifying the exact output schema (often JSON or YAML) and any business rules (e.g., “only return fields that are present in the source document”).
  3. Sample generation – Realistic input instances were collected from open datasets or anonymized logs, then manually verified to ensure they reflect production‑level complexity.
  4. Scoring pipeline – Automatic parsers check structural compliance (schema validation), content correctness (exact match or fuzzy metrics), and procedural adherence (presence of required steps). Human raters audit a subset for safety and nuanced reasoning errors.
  5. Model roster – The benchmark was run on 11 LLMs ranging from 7B‑parameter open models to proprietary 175B‑parameter services, using the same temperature‑0 (deterministic) setting to isolate instruction‑following ability.

Results & Findings

Model (size)Avg. compliance score*Best dimensionWeakest dimension
Open‑source 7B48%Format complianceMulti‑step reasoning
Open‑source 13B55%Content fidelityAPI call correctness
Proprietary 70B (e.g., Claude)78%Safety constraintsError handling
Proprietary 175B (e.g., GPT‑4)84%Overall balanceSlight drop on strict JSON nesting

*Composite of the six dimensions, each weighted equally.

  • Higher‑capacity models consistently outperform smaller ones, but the gap narrows when the task emphasizes strict schema validation.
  • Open‑source models excel at simple extraction but often hallucinate extra fields or omit required keys, leading to downstream pipeline failures.
  • Safety and policy compliance are strong across the board for the commercial APIs, reflecting their built‑in guardrails.
  • Error‑handling behavior (e.g., returning a helpful error message when input is malformed) is the most under‑addressed capability, even for the best‑performing models.

Practical Implications

  • Tooling developers can use FireBench as a sanity check before integrating an LLM into a ticket‑routing bot or a data‑pipeline, ensuring the model respects the exact JSON contract required by downstream services.
  • Enterprises can benchmark multiple providers side‑by‑side, making cost‑vs‑reliability decisions based on concrete compliance numbers rather than anecdotal performance.
  • Model vendors gain a diagnostic lens to prioritize improvements—e.g., tightening schema‑adherence logic or adding explicit “error‑report” tokens to prompts.
  • Open‑source community can target the identified weak spots (multi‑step reasoning, error handling) by fine‑tuning on FireBench‑style data, potentially closing the gap with commercial offerings.

Limitations & Future Work

  • Domain coverage: While the benchmark spans several common enterprise scenarios, it does not yet include highly regulated domains such as finance or healthcare, where compliance constraints are even stricter.
  • Prompt diversity: All prompts were hand‑crafted; exploring automatically generated variations could reveal additional failure modes.
  • Dynamic APIs: FireBench evaluates static input‑output pairs; future extensions could incorporate live API calls to test real‑time latency and stateful interactions.
  • Model diversity: The study focuses on 11 models; expanding to newer open‑source releases and emerging multimodal LLMs would broaden the relevance.

FireBench opens the door for systematic, production‑oriented evaluation of LLM instruction following—an essential step toward trustworthy AI‑augmented enterprise workflows.

Authors

  • Yunfan Zhang
  • Yijie Bei
  • Jetashree Ravi
  • Pawel Garbacki

Paper Information

  • arXiv ID: 2603.04857v1
  • Categories: cs.CL, cs.SE
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »