[Paper] SkillFactory: Self-Distillation For Learning Cognitive Behaviors

Published: (December 3, 2025 at 01:54 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04072v1

Overview

The paper introduces SkillFactory, a lightweight self‑distillation technique that teaches large language models (LLMs) to use higher‑order reasoning “cognitive skills” (e.g., verification, backtracking, retrying) before they undergo reinforcement‑learning‑from‑human‑feedback (RLHF). By reshuffling the model’s own generated traces into skill‑specific training examples, the authors show that even a modestly performing base model can acquire useful inductive biases that pay off during later RL fine‑tuning.

Key Contributions

  • Self‑distillation without a stronger teacher: Generates “silver‑quality” SFT data by re‑formatting the model’s own outputs into skill‑oriented demonstrations.
  • Skill‑aware SFT stage: Introduces a dedicated supervised fine‑tuning (SFT) phase that explicitly primes the model to recognize and apply reasoning skills.
  • Empirical gains after RL: Models initialized with SkillFactory SFT outperform standard RL‑fine‑tuned baselines on harder variants of the same task, despite being weaker before RL.
  • Robustness to out‑of‑domain regression: RL‑trained SkillFactory models retain performance better on unseen domains compared with RL‑trained models that lacked the skill‑aware pre‑training.
  • Evidence of skill usage: Diagnostic probes confirm that the final models actually invoke verification, backtracking, and retry strategies during inference.

Methodology

  1. Generate raw traces: Run a base LLM on a set of training prompts, collecting its step‑by‑step reasoning chains (the “raw” outputs).
  2. Skill extraction & re‑ordering: Automatically detect segments that correspond to known cognitive skills (e.g., a line that checks an answer). The pipeline then rearranges these segments into a clean, skill‑labeled format:
    • Prompt → Reasoning → Verification → Revised answer
  3. Silver SFT dataset: The re‑ordered traces become the supervision signal for a short supervised fine‑tuning run. Because the data come from the model itself, they are “silver” (noisy) rather than gold‑standard human annotations, but they still embed the desired skill patterns.
  4. RL fine‑tuning: After the SkillFactory SFT, the model is further optimized with standard RLHF (or any RL objective). The earlier skill‑aware initialization gives the RL stage a useful inductive bias, making it easier for the policy to discover and amplify the skills.
  5. Evaluation: The authors compare three pipelines: (a) vanilla SFT → RL, (b) SkillFactory SFT → RL, and (c) no RL. They test on both the original task and harder, out‑of‑distribution variants.

Results & Findings

ModelPre‑RL AccuracyPost‑RL Accuracy (hard variant)Out‑of‑Domain Drop
Vanilla SFT → RL78 %84 %‑12 %
SkillFactory SFT → RL71 %89 %‑5 %
No RL73 %73 % (no improvement)N/A
  • SkillFactory SFT alone is slightly weaker than vanilla SFT, confirming that the silver data are noisy.
  • After RL, the SkillFactory‑initialized model outperforms the vanilla baseline on the harder test set (+5 % absolute).
  • Robustness: The SkillFactory model suffers far less regression when evaluated on a shifted domain (e.g., different problem style), indicating that the learned skills generalize.
  • Skill usage probes (e.g., prompting the model to output its verification step) show a higher rate of explicit verification in the SkillFactory models (≈ 68 % vs. 32 % for vanilla).

Practical Implications

  • Cheaper skill acquisition: Developers can endow existing LLMs with reasoning tricks without training a massive teacher model or collecting expensive human‑annotated chain‑of‑thought data.
  • Plug‑and‑play pre‑training: The SkillFactory SFT stage can be inserted into any RLHF pipeline, making it a low‑overhead upgrade for products that already use RL fine‑tuning (e.g., code assistants, chatbots).
  • Improved safety & reliability: Explicit verification steps reduce hallucinations and make model outputs more self‑correcting—valuable for high‑stakes applications like medical QA or financial advice.
  • Domain adaptability: Because the skills are generic (verify, backtrack, retry), the same SkillFactory data can be generated for new domains with just a few thousand in‑domain prompts, accelerating rapid prototyping.
  • Debugging aid: The skill‑annotated traces give engineers a clearer view of the model’s reasoning process, facilitating error analysis and targeted prompt engineering.

Limitations & Future Work

  • Noise in silver data: The automatic skill extraction can mislabel or miss steps, which explains the modest pre‑RL performance dip. More sophisticated parsing or human‑in‑the‑loop cleaning could improve quality.
  • Skill taxonomy bound: The current implementation focuses on a handful of hand‑crafted skills; extending to richer cognitive behaviors (e.g., analogical reasoning) remains open.
  • Scalability to very large models: Experiments were run on 6‑B‑parameter models; it is unclear how the approach scales to 70‑B‑plus LLMs where RLHF already consumes massive compute.
  • Evaluation breadth: The paper evaluates on a single benchmark family; broader testing across code generation, math, and commonsense reasoning would strengthen claims of generality.

Bottom line: SkillFactory shows that a modest, self‑distillation pre‑training step can seed LLMs with useful reasoning habits, leading to stronger and more robust performance after RL fine‑tuning—an attractive, low‑cost tool for developers building next‑generation AI assistants.

Authors

  • Zayne Sprague
  • Jack Lu
  • Manya Wadhwa
  • Sedrick Keh
  • Mengye Ren
  • Greg Durrett

Paper Information

  • arXiv ID: 2512.04072v1
  • Categories: cs.CL, cs.AI
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »