[Paper] Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially ...