[Paper] Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

Published: (February 16, 2026 at 01:45 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.15005v1

Overview

The paper proposes a novel way to understand what readers truly care about by turning diverse user signals—such as clicks, likes, and even activity on other platforms—into interest‑driven search queries. By training large language models (LLMs) with reinforcement learning, the authors generate high‑quality query lists that feed directly into a cross‑domain news recommender, achieving better personalization while keeping the system scalable for production use.

Key Contributions

  • Reinforcement‑learning‑driven query generation: Formulates the creation of interest‑focused query lists as a policy‑optimization problem and solves it with Generalized Reward‑Weighted Policy Optimization (GRPO).
  • Multi‑reward design: Combines relevance, diversity, and user‑engagement signals into a single reward function that guides the LLM toward useful queries.
  • Compute scaling study: Shows that both inference‑time sampling (more generated candidates) and larger model capacity consistently improve performance, exhibiting a predictable scaling law.
  • On‑policy distillation pipeline: Transfers the policy from a heavyweight teacher LLM to a lightweight student model, preserving most of the gains while meeting latency and resource constraints of real‑time recommendation.
  • Extensive validation: Provides offline experiments, ablation analyses, and a large‑scale online A/B test on a production news platform, demonstrating measurable lifts in both interest modeling metrics and downstream click‑through rates.

Methodology

  1. Signal aggregation – The system collects heterogeneous user actions from the news site and other domains (e.g., search, social media).
  2. Prompt‑based LLM generation – A large language model receives a prompt describing the user’s recent activity and is asked to output a short list of search‑style queries that capture the user’s latent interests.
  3. Reinforcement learning loop – The model’s policy is optimized with GRPO. The reward function blends:
    • Relevance: how well the generated queries match known user interests (via click logs).
    • Diversity: encouraging a breadth of topics to avoid echo chambers.
    • Engagement: predicted uplift in downstream recommendation metrics.
  4. Scaling experiments – The authors vary two axes: (a) the number of sampled queries per inference step, and (b) the size of the underlying LLM (from 350 M to 6 B parameters).
  5. Distillation – After training the large teacher, an on‑policy distillation step trains a compact student model to imitate the teacher’s query distribution, using KL‑divergence loss plus the same reward signals.
  6. Integration – The distilled query list is fed into the existing news ranking pipeline as an additional feature set, influencing which articles are shown to the user.

Results & Findings

MetricLarge Teacher (6 B)Distilled Student (350 M)Baseline (no query generation)
Query relevance (nDCG@10)0.6420.6180.511
Diversity (ILD)0.730.710.58
Downstream CTR lift+12.4 %+10.1 %
Latency (ms)782319
  • Scaling behavior: Each doubling of model size or sample count yields ~3–4 % incremental gain, following a smooth power‑law trend.
  • Distillation efficiency: The student recovers ~85 % of the teacher’s performance while cutting inference latency by ~70 %, making it viable for real‑time serving.
  • Online impact: In a live A/B test with millions of daily active users, the distilled model increased overall click‑through rate by 10.1 % and average session length by 5.3 %, with no degradation in system latency.

Practical Implications

  • Richer user profiling: Developers can augment existing recommendation pipelines with a lightweight query‑generation module that captures interests beyond explicit clicks, improving cold‑start handling.
  • Scalable personalization: The distillation recipe lets teams deploy near‑state‑of‑the‑art LLM reasoning without sacrificing latency, fitting into micro‑service architectures.
  • Cross‑domain leverage: By ingesting signals from search, social, or e‑commerce platforms, news apps can surface articles that align with a user’s broader information needs, potentially increasing user stickiness.
  • Modular integration: The generated query list can be treated as an additional feature vector for any downstream ranking model (e.g., gradient‑boosted trees, deep CTR models), making adoption straightforward.
  • Open‑source potential: The authors’ code for GRPO‑based policy training and on‑policy distillation could be repurposed for other recommendation domains such as video or product suggestions.

Limitations & Future Work

  • Reward design complexity: Balancing relevance, diversity, and engagement requires careful tuning; suboptimal weights can lead to over‑personalization or topic drift.
  • Data privacy: Aggregating cross‑domain signals raises privacy considerations; the paper assumes compliant data pipelines but does not explore privacy‑preserving alternatives.
  • Model freshness: The LLM is trained offline; rapid shifts in trending topics may require frequent re‑training or online fine‑tuning, which the current pipeline does not address.
  • Generalization to other languages: Experiments are limited to English news; extending the approach to multilingual settings may need larger multilingual LLMs and language‑specific reward calibrations.

Future research directions include exploring privacy‑preserving federated learning for cross‑domain signals, continual learning mechanisms to keep the query generator up‑to‑date, and multilingual extensions to serve global news audiences.

Authors

  • Mengdan Zhu
  • Yufan Zhao
  • Tao Di
  • Yulan Yan
  • Liang Zhao

Paper Information

  • arXiv ID: 2602.15005v1
  • Categories: cs.CL, cs.IR
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »