[Paper] Task-Centric Acceleration of Small-Language Models
Source: arXiv - 2602.24174v1
Overview
The paper introduces TASC (Task‑Adaptive Sequence Compression), a two‑pronged framework that speeds up small language models (SLMs) without sacrificing accuracy. By expanding the tokenizer during fine‑tuning (TASC‑ft) and by using a lightweight speculative decoding technique at inference time (TASC‑spec), the authors show that SLMs can handle high‑throughput, low‑latency workloads more efficiently than before.
Key Contributions
- TASC‑ft: An iterative fine‑tuning pipeline that augments the model’s tokenizer with the most frequent output n‑grams, then fine‑tunes the model to exploit the enlarged vocabulary.
- TASC‑spec: A training‑free speculative decoding method that builds a task‑specific n‑gram “draft” model from the target output corpus and mixes it with the context during generation.
- Vocabulary‑agnostic drafting: Unlike conventional speculative decoding, TASC‑spec does not require the draft and target models to share the same token set, eliminating a major engineering hurdle.
- Empirical validation: Demonstrated consistent inference speed‑ups (up to ~2×) on several low‑output‑variability tasks (e.g., code generation, form filling) while keeping task metrics (BLEU, exact match) within 1–2 % of the baseline.
- Open‑source reference implementation: The authors release code and pretrained tokenizers, making it easy for practitioners to plug TASC into existing pipelines.
Methodology
-
Token Vocabulary Expansion (TASC‑ft)
- Run the SLM on a representative dataset and collect the most frequent output n‑grams (e.g., common phrases, code snippets).
- Add these n‑grams as new tokens to the tokenizer, effectively compressing recurring sequences into single tokens.
- Fine‑tune the SLM on the same data while learning embeddings for the new tokens. The process repeats until marginal gains plateau.
-
Speculative Decoding without Training (TASC‑spec)
- Build a lightweight n‑gram language model (the “draft”) from the task’s output corpus. This model predicts the next token sequence based on recent context.
- During generation, the draft proposes a short token chunk (the “draft”). The target SLM then verifies the draft in a single forward pass; if the draft is correct, the tokens are accepted, otherwise the SLM falls back to normal decoding.
- Because the draft operates on raw n‑grams rather than token IDs, there is no need to align vocabularies between draft and target models.
Both components are modular: TASC‑ft improves the model itself, while TASC‑spec can be dropped in at inference time for any compatible SLM.
Results & Findings
| Task | Baseline (tokens/s) | TASC‑ft | TASC‑spec | Combined | Metric Δ (e.g., BLEU) |
|---|---|---|---|---|---|
| Code snippet generation | 1,200 | +12 % | +45 % | +55 % | –0.8 % |
| Form‑field filling | 1,800 | +9 % | +38 % | +48 % | –0.4 % |
| Short answer QA | 2,000 | +7 % | +30 % | +36 % | –0.2 % |
- Speed‑up: TASC‑spec alone yields 30–45 % faster inference; when combined with TASC‑ft the gain climbs to roughly 50–55 %.
- Quality preservation: Task performance drops by less than 1 % across all benchmarks, which is within typical variance for SLMs.
- Scalability: Gains are more pronounced on tasks with low output variability (i.e., where the same phrases appear repeatedly), confirming the intuition behind n‑gram compression.
Practical Implications
- Production‑grade SLM services: Companies can retrofit existing small models with TASC‑ft to reduce token count, lowering memory footprints and enabling higher batch sizes on the same hardware.
- Edge deployment: The vocabulary expansion means fewer inference steps, which is valuable for on‑device applications (e.g., autocomplete on mobile keyboards).
- Zero‑training acceleration: TASC‑spec can be added to any deployed SLM without retraining, offering an immediate latency reduction for latency‑sensitive APIs (e.g., chat assistants, real‑time code suggestions).
- Cost savings: Faster inference translates directly into lower GPU/CPU utilization, cutting operational expenses for high‑throughput services.
- Simplified pipelines: Because TASC‑spec sidesteps draft‑target vocabulary alignment, developers avoid the engineering overhead of maintaining parallel tokenizers.
Limitations & Future Work
- Task dependency: The methods excel on low‑output‑variability tasks; highly creative generation (e.g., story writing) sees limited speed‑up.
- Vocabulary bloat risk: Aggressive token expansion can inflate the tokenizer size, potentially offsetting memory gains if not carefully tuned.
- Speculative draft quality: The n‑gram draft model is simple; more sophisticated drafts (e.g., lightweight transformer drafts) could push speed‑ups further but would add complexity.
- Broader evaluation: Future work could explore TASC on multilingual SLMs, larger model families, and integration with other acceleration techniques like quantization or pruning.
Overall, TASC offers a pragmatic, developer‑friendly route to make small language models faster and cheaper, opening the door for wider adoption in real‑world, latency‑critical applications.
Authors
- Dor Tsur
- Sharon Adar
- Ran Levy
Paper Information
- arXiv ID: 2602.24174v1
- Categories: cs.CL, cs.AI, cs.IT
- Published: February 27, 2026
- PDF: Download PDF