MiniCPM4: Ultra-Efficient LLMs on End Devices – Kavli Institute Pre-Print Publications

Kavli Affiliate: Feng Wang

| First 5 Authors: MiniCPM Team, MiniCPM Team, , ,

| Summary:

This paper introduces MiniCPM4, a highly efficient large language model (LLM)
designed explicitly for end-side devices. We achieve this efficiency through
systematic innovation in four key dimensions: model architecture, training
data, training algorithms, and inference systems. Specifically, in terms of
model architecture, we propose InfLLM v2, a trainable sparse attention
mechanism that accelerates both prefilling and decoding phases for long-context
processing. Regarding training data, we propose UltraClean, an efficient and
accurate pre-training data filtering and generation strategy, and UltraChat v2,
a comprehensive supervised fine-tuning dataset. These datasets enable
satisfactory model performance to be achieved using just 8 trillion training
tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient
pre-training strategy search, and improve existing post-training methods by
introducing chunk-wise rollout for load-balanced reinforcement learning and
data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose
CPM.cu that integrates sparse attention, model quantization, and speculative
sampling to achieve efficient prefilling and decoding. To meet diverse
on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B
parameters, respectively. Furthermore, we construct a hybrid reasoning model,
MiniCPM4.1, which can be used in both deep reasoning mode and non-reasoning
mode. Evaluation results demonstrate that MiniCPM4 and MiniCPM4.1 outperform
similar-sized open-source models across benchmarks, with the 8B variants
showing significant speed improvements on long sequence understanding and
generation.

| Search Query: ArXiv Query: search_query=au:”Feng Wang”&id_list=&start=0&max_results=3