[Paper] Task-Centric Acceleration of Small-Language Models

Published: 3 days ago (February 27, 2026 at 11:55 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.24174v1

Overview

The paper introduces TASC (Task‑Adaptive Sequence Compression), a two‑pronged framework that speeds up small language models (SLMs) without sacrificing accuracy. By expanding the tokenizer during fine‑tuning (TASC‑ft) and by using a lightweight speculative decoding technique at inference time (TASC‑spec), the authors show that SLMs can handle high‑throughput, low‑latency workloads more efficiently than before.

Key Contributions

TASC‑ft: An iterative fine‑tuning pipeline that augments the model’s tokenizer with the most frequent output n‑grams, then fine‑tunes the model to exploit the enlarged vocabulary.
TASC‑spec: A training‑free speculative decoding method that builds a task‑specific n‑gram “draft” model from the target output corpus and mixes it with the context during generation.
Vocabulary‑agnostic drafting: Unlike conventional speculative decoding, TASC‑spec does not require the draft and target models to share the same token set, eliminating a major engineering hurdle.
Empirical validation: Demonstrated consistent inference speed‑ups (up to ~2×) on several low‑output‑variability tasks (e.g., code generation, form filling) while keeping task metrics (BLEU, exact match) within 1–2 % of the baseline.
Open‑source reference implementation: The authors release code and pretrained tokenizers, making it easy for practitioners to plug TASC into existing pipelines.

Methodology

Token Vocabulary Expansion (TASC‑ft)
- Run the SLM on a representative dataset and collect the most frequent output n‑grams (e.g., common phrases, code snippets).
- Add these n‑grams as new tokens to the tokenizer, effectively compressing recurring sequences into single tokens.
- Fine‑tune the SLM on the same data while learning embeddings for the new tokens. The process repeats until marginal gains plateau.
Speculative Decoding without Training (TASC‑spec)
- Build a lightweight n‑gram language model (the “draft”) from the task’s output corpus. This model predicts the next token sequence based on recent context.
- During generation, the draft proposes a short token chunk (the “draft”). The target SLM then verifies the draft in a single forward pass; if the draft is correct, the tokens are accepted, otherwise the SLM falls back to normal decoding.
- Because the draft operates on raw n‑grams rather than token IDs, there is no need to align vocabularies between draft and target models.

Both components are modular: TASC‑ft improves the model itself, while TASC‑spec can be dropped in at inference time for any compatible SLM.

Results & Findings

Task	Baseline (tokens/s)	TASC‑ft	TASC‑spec	Combined	Metric Δ (e.g., BLEU)
Code snippet generation	1,200	+12 %	+45 %	+55 %	–0.8 %
Form‑field filling	1,800	+9 %	+38 %	+48 %	–0.4 %
Short answer QA	2,000	+7 %	+30 %	+36 %	–0.2 %

Speed‑up: TASC‑spec alone yields 30–45 % faster inference; when combined with TASC‑ft the gain climbs to roughly 50–55 %.
Quality preservation: Task performance drops by less than 1 % across all benchmarks, which is within typical variance for SLMs.
Scalability: Gains are more pronounced on tasks with low output variability (i.e., where the same phrases appear repeatedly), confirming the intuition behind n‑gram compression.

Practical Implications

Production‑grade SLM services: Companies can retrofit existing small models with TASC‑ft to reduce token count, lowering memory footprints and enabling higher batch sizes on the same hardware.
Edge deployment: The vocabulary expansion means fewer inference steps, which is valuable for on‑device applications (e.g., autocomplete on mobile keyboards).
Zero‑training acceleration: TASC‑spec can be added to any deployed SLM without retraining, offering an immediate latency reduction for latency‑sensitive APIs (e.g., chat assistants, real‑time code suggestions).
Cost savings: Faster inference translates directly into lower GPU/CPU utilization, cutting operational expenses for high‑throughput services.
Simplified pipelines: Because TASC‑spec sidesteps draft‑target vocabulary alignment, developers avoid the engineering overhead of maintaining parallel tokenizers.

Limitations & Future Work

Task dependency: The methods excel on low‑output‑variability tasks; highly creative generation (e.g., story writing) sees limited speed‑up.
Vocabulary bloat risk: Aggressive token expansion can inflate the tokenizer size, potentially offsetting memory gains if not carefully tuned.
Speculative draft quality: The n‑gram draft model is simple; more sophisticated drafts (e.g., lightweight transformer drafts) could push speed‑ups further but would add complexity.
Broader evaluation: Future work could explore TASC on multilingual SLMs, larger model families, and integration with other acceleration techniques like quantization or pruning.

Overall, TASC offers a pragmatic, developer‑friendly route to make small language models faster and cheaper, opening the door for wider adoption in real‑world, latency‑critical applications.

Authors

Dor Tsur
Sharon Adar
Ran Levy

Paper Information

arXiv ID: 2602.24174v1
Categories: cs.CL, cs.AI, cs.IT
Published: February 27, 2026
PDF: Download PDF

[Paper] Task-Centric Acceleration of Small-Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

[Paper] Do LLMs Benefit From Their Own Words?

[Paper] Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

[Paper] Controllable Reasoning Models Are Private Thinkers