Kavli Affiliate: Jing Wang | First 5 Authors: Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang | Summary: The Mixture-of-Experts (MoE) architecture has demonstrated significant advantages in the era of Large Language Models (LLMs), offering enhanced capabilities with reduced inference costs. However, deploying MoE-based LLMs on memoryconstrained edge devices remains challenging due to […]
Continue.. HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference