[Paper] Task-Centric Acceleration of Small-Language Models

Published: (February 27, 2026 at 11:55 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.24174v1

Overview

The paper introduces TASC (Task‑Adaptive Sequence Compression), a two‑pronged framework that speeds up small language models (SLMs) without sacrificing accuracy. By expanding the tokenizer during fine‑tuning (TASC‑ft) and by using a lightweight speculative decoding technique at inference time (TASC‑spec), the authors show that SLMs can handle high‑throughput, low‑latency workloads more efficiently than before.

Key Contributions

  • TASC‑ft: An iterative fine‑tuning pipeline that augments the model’s tokenizer with the most frequent output n‑grams, then fine‑tunes the model to exploit the enlarged vocabulary.
  • TASC‑spec: A training‑free speculative decoding method that builds a task‑specific n‑gram “draft” model from the target output corpus and mixes it with the context during generation.
  • Vocabulary‑agnostic drafting: Unlike conventional speculative decoding, TASC‑spec does not require the draft and target models to share the same token set, eliminating a major engineering hurdle.
  • Empirical validation: Demonstrated consistent inference speed‑ups (up to ~2×) on several low‑output‑variability tasks (e.g., code generation, form filling) while keeping task metrics (BLEU, exact match) within 1–2 % of the baseline.
  • Open‑source reference implementation: The authors release code and pretrained tokenizers, making it easy for practitioners to plug TASC into existing pipelines.

Methodology

  1. Token Vocabulary Expansion (TASC‑ft)

    • Run the SLM on a representative dataset and collect the most frequent output n‑grams (e.g., common phrases, code snippets).
    • Add these n‑grams as new tokens to the tokenizer, effectively compressing recurring sequences into single tokens.
    • Fine‑tune the SLM on the same data while learning embeddings for the new tokens. The process repeats until marginal gains plateau.
  2. Speculative Decoding without Training (TASC‑spec)

    • Build a lightweight n‑gram language model (the “draft”) from the task’s output corpus. This model predicts the next token sequence based on recent context.
    • During generation, the draft proposes a short token chunk (the “draft”). The target SLM then verifies the draft in a single forward pass; if the draft is correct, the tokens are accepted, otherwise the SLM falls back to normal decoding.
    • Because the draft operates on raw n‑grams rather than token IDs, there is no need to align vocabularies between draft and target models.

Both components are modular: TASC‑ft improves the model itself, while TASC‑spec can be dropped in at inference time for any compatible SLM.

Results & Findings

TaskBaseline (tokens/s)TASC‑ftTASC‑specCombinedMetric Δ (e.g., BLEU)
Code snippet generation1,200+12 %+45 %+55 %–0.8 %
Form‑field filling1,800+9 %+38 %+48 %–0.4 %
Short answer QA2,000+7 %+30 %+36 %–0.2 %
  • Speed‑up: TASC‑spec alone yields 30–45 % faster inference; when combined with TASC‑ft the gain climbs to roughly 50–55 %.
  • Quality preservation: Task performance drops by less than 1 % across all benchmarks, which is within typical variance for SLMs.
  • Scalability: Gains are more pronounced on tasks with low output variability (i.e., where the same phrases appear repeatedly), confirming the intuition behind n‑gram compression.

Practical Implications

  • Production‑grade SLM services: Companies can retrofit existing small models with TASC‑ft to reduce token count, lowering memory footprints and enabling higher batch sizes on the same hardware.
  • Edge deployment: The vocabulary expansion means fewer inference steps, which is valuable for on‑device applications (e.g., autocomplete on mobile keyboards).
  • Zero‑training acceleration: TASC‑spec can be added to any deployed SLM without retraining, offering an immediate latency reduction for latency‑sensitive APIs (e.g., chat assistants, real‑time code suggestions).
  • Cost savings: Faster inference translates directly into lower GPU/CPU utilization, cutting operational expenses for high‑throughput services.
  • Simplified pipelines: Because TASC‑spec sidesteps draft‑target vocabulary alignment, developers avoid the engineering overhead of maintaining parallel tokenizers.

Limitations & Future Work

  • Task dependency: The methods excel on low‑output‑variability tasks; highly creative generation (e.g., story writing) sees limited speed‑up.
  • Vocabulary bloat risk: Aggressive token expansion can inflate the tokenizer size, potentially offsetting memory gains if not carefully tuned.
  • Speculative draft quality: The n‑gram draft model is simple; more sophisticated drafts (e.g., lightweight transformer drafts) could push speed‑ups further but would add complexity.
  • Broader evaluation: Future work could explore TASC on multilingual SLMs, larger model families, and integration with other acceleration techniques like quantization or pruning.

Overall, TASC offers a pragmatic, developer‑friendly route to make small language models faster and cheaper, opening the door for wider adoption in real‑world, latency‑critical applications.

Authors

  • Dor Tsur
  • Sharon Adar
  • Ran Levy

Paper Information

  • arXiv ID: 2602.24174v1
  • Categories: cs.CL, cs.AI, cs.IT
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »