[Paper] CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
Source: arXiv - 2602.24142v1
Overview
The paper introduces CoME (Channel‑of‑Mobile‑Experts), a new architecture for mobile AI agents that can follow user instructions by chaining together several reasoning capabilities—summarizing the screen, planning subtasks, deciding actions, and executing functions. By structuring these capabilities as separate “experts” that are activated only when needed, CoME achieves both modular improvement and tight integration, addressing a long‑standing bottleneck in current mobile agents.
Key Contributions
- Channel‑of‑Mobile‑Experts (CoME) architecture: four dedicated experts (Screen‑Summarizer, Planner, Action‑Decider, Function‑Executor) that are invoked via an output‑oriented activation mechanism.
- Progressive training pipeline:
- Expert‑FT – fine‑tunes each expert independently, allowing targeted capability upgrades.
- Router‑FT – learns to route the conversation to the right expert at each reasoning stage.
- CoT‑FT – fine‑tunes the whole chain as a “chain‑of‑thought”, encouraging smooth hand‑offs between experts.
- InfoGain‑Driven DPO (Info‑DPO): a reinforcement‑style fine‑tuning step that scores intermediate steps by their information gain, reducing error propagation and nudging the agent toward more informative reasoning traces.
- Empirical validation: CoME beats dense (single‑model) mobile agents and existing mixture‑of‑experts (MoE) baselines on two benchmark suites—AITZ (AI‑driven task‑completion) and AMEX (mobile‑app execution).
Methodology
- Modular Expert Design – The agent is split into four lightweight language models, each specialized for a distinct sub‑task of the overall instruction.
- Output‑Oriented Activation – Instead of a static pipeline, CoME watches the token stream; when a token signals the end of a stage (e.g., a summary is complete), the router switches to the next expert.
- Progressive Fine‑Tuning:
- Expert‑FT: Each expert is fine‑tuned on a curated dataset that isolates its function (e.g., screen‑summaries from UI screenshots).
- Router‑FT: A small classifier learns to predict the correct expert given the current dialogue context.
- CoT‑FT: The whole system is then fine‑tuned end‑to‑end using chain‑of‑thought prompts, encouraging the experts to produce compatible intermediate outputs.
- Info‑DPO: During reinforcement‑style fine‑tuning, the system computes the information gain of each intermediate step (how much it reduces uncertainty about the final answer). Steps with higher gain receive larger rewards, steering the model away from redundant or misleading reasoning.
Results & Findings
| Metric | Dense Mobile Agent | MoE Baseline | CoME (full) |
|---|---|---|---|
| Success Rate (AITZ) | 68.2 % | 71.5 % | 78.9 % |
| Task Completion (AMEX) | 62.4 % | 66.1 % | 74.3 % |
| Avg. Reasoning Steps | 12.4 | 11.8 | 9.6 |
| Info‑Gain Score (higher is better) | 0.41 | 0.45 | 0.58 |
- Higher success rates: CoME solves more tasks across both benchmarks, especially in complex multi‑step scenarios.
- Fewer steps: The router’s stage‑aware activation cuts down unnecessary back‑and‑forth, making the reasoning trace shorter and more interpretable.
- Improved robustness: Info‑DPO reduces error cascades; when an early stage makes a mistake, later experts can still recover because the system penalizes low‑information steps.
Practical Implications
- Developer‑friendly modular upgrades – Teams can improve a single capability (e.g., UI summarization) without retraining the whole agent, accelerating iteration cycles.
- Lower compute on‑device – Because each expert is lightweight and only one runs at a time, the memory footprint is smaller than a monolithic large language model, making CoME suitable for smartphones, wearables, or edge devices.
- Better debugging & observability – The explicit stage boundaries give developers clear logs (“summary generated”, “plan chosen”), simplifying troubleshooting of mobile assistants.
- Potential for plug‑and‑play ecosystems – Third‑party developers could ship specialized experts (e.g., for a new app’s API) that the router can invoke, fostering an extensible marketplace of mobile‑agent capabilities.
Limitations & Future Work
- Dataset bias – The training data for each expert comes from curated UI logs; performance may degrade on novel app designs or heavily customized interfaces.
- Router mis‑routing – Although Router‑FT improves stage prediction, occasional mis‑classifications still force the wrong expert to act, leading to failure cascades.
- Scalability of expert count – Adding more specialized experts (e.g., for voice input, AR overlays) could increase routing complexity; the paper leaves optimal scaling strategies for future research.
- User‑privacy considerations – On‑device fine‑tuning with personal UI data is promising but requires robust privacy‑preserving mechanisms, which are not explored in depth.
Overall, CoME demonstrates that a thoughtfully modular, stage‑aware architecture can deliver more capable and efficient mobile agents, opening a path toward truly assistant‑level AI on everyday devices.
Authors
- Yuxuan Liu
- Weikai Xu
- Kun Huang
- Changyu Chen
- Jiankun Zhao
- Pengzhi Gao
- Wei Liu
- Jian Luan
- Shuo Shang
- Bo Du
- Ji-Rong Wen
- Rui Yan
Paper Information
- arXiv ID: 2602.24142v1
- Categories: cs.CL, cs.AI
- Published: February 27, 2026
- PDF: Download PDF