[Paper] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Published: 1 day ago (March 5, 2026 at 01:25 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04857v1

Overview

The paper introduces FireBench, a new benchmark that measures how well large language models (LLMs) follow precise instructions in real‑world enterprise and API‑driven contexts. Unlike existing chat‑assistant benchmarks that focus on free‑form conversation, FireBench zeroes in on the strict formatting, content, and procedural constraints that businesses need for reliable automation.

Key Contributions

Enterprise‑focused benchmark: 2,400+ curated samples drawn from actual information‑extraction pipelines, customer‑support ticket handling, and code‑generation agents.
Six capability dimensions: evaluates format compliance, content fidelity, safety constraints, multi‑step reasoning, API call correctness, and error‑handling behavior.
Broad model coverage: systematic evaluation of 11 publicly available LLMs (including open‑source and commercial offerings).
Open‑source release: benchmark data, evaluation scripts, and a public leaderboard are available at fire‑bench.com.
Diagnostic insights: detailed analysis of failure modes that are specific to enterprise workflows (e.g., mismatched JSON schemas, omitted required fields).

Methodology

Use‑case selection – The authors surveyed internal tooling and public APIs used by enterprises (e.g., ticketing systems, data‑extraction services, code‑review bots) and distilled six representative tasks.
Prompt design – For each task, a clear instruction template was written, specifying the exact output schema (often JSON or YAML) and any business rules (e.g., “only return fields that are present in the source document”).
Sample generation – Realistic input instances were collected from open datasets or anonymized logs, then manually verified to ensure they reflect production‑level complexity.
Scoring pipeline – Automatic parsers check structural compliance (schema validation), content correctness (exact match or fuzzy metrics), and procedural adherence (presence of required steps). Human raters audit a subset for safety and nuanced reasoning errors.
Model roster – The benchmark was run on 11 LLMs ranging from 7B‑parameter open models to proprietary 175B‑parameter services, using the same temperature‑0 (deterministic) setting to isolate instruction‑following ability.

Results & Findings

Model (size)	Avg. compliance score*	Best dimension	Weakest dimension
Open‑source 7B	48%	Format compliance	Multi‑step reasoning
Open‑source 13B	55%	Content fidelity	API call correctness
Proprietary 70B (e.g., Claude)	78%	Safety constraints	Error handling
Proprietary 175B (e.g., GPT‑4)	84%	Overall balance	Slight drop on strict JSON nesting

*Composite of the six dimensions, each weighted equally.

Higher‑capacity models consistently outperform smaller ones, but the gap narrows when the task emphasizes strict schema validation.
Open‑source models excel at simple extraction but often hallucinate extra fields or omit required keys, leading to downstream pipeline failures.
Safety and policy compliance are strong across the board for the commercial APIs, reflecting their built‑in guardrails.
Error‑handling behavior (e.g., returning a helpful error message when input is malformed) is the most under‑addressed capability, even for the best‑performing models.

Practical Implications

Tooling developers can use FireBench as a sanity check before integrating an LLM into a ticket‑routing bot or a data‑pipeline, ensuring the model respects the exact JSON contract required by downstream services.
Enterprises can benchmark multiple providers side‑by‑side, making cost‑vs‑reliability decisions based on concrete compliance numbers rather than anecdotal performance.
Model vendors gain a diagnostic lens to prioritize improvements—e.g., tightening schema‑adherence logic or adding explicit “error‑report” tokens to prompts.
Open‑source community can target the identified weak spots (multi‑step reasoning, error handling) by fine‑tuning on FireBench‑style data, potentially closing the gap with commercial offerings.

Limitations & Future Work

Domain coverage: While the benchmark spans several common enterprise scenarios, it does not yet include highly regulated domains such as finance or healthcare, where compliance constraints are even stricter.
Prompt diversity: All prompts were hand‑crafted; exploring automatically generated variations could reveal additional failure modes.
Dynamic APIs: FireBench evaluates static input‑output pairs; future extensions could incorporate live API calls to test real‑time latency and stateful interactions.
Model diversity: The study focuses on 11 models; expanding to newer open‑source releases and emerging multimodal LLMs would broaden the relevance.

FireBench opens the door for systematic, production‑oriented evaluation of LLM instruction following—an essential step toward trustworthy AI‑augmented enterprise workflows.

Authors

Yunfan Zhang
Yijie Bei
Jetashree Ravi
Pawel Garbacki

Paper Information

arXiv ID: 2603.04857v1
Categories: cs.CL, cs.SE
Published: March 5, 2026
PDF: Download PDF

[Paper] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought