MiniCPM-o Team, OpenBMB

➤ Project Website

➤ Model Weights

🔗 https://huggingface.co/openbmb/MiniCPM-o-2_6

🔗 https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6

➤ Demo

🔗 https://minicpm-omni-webdemo-us.modelbest.cn/

⛰️ Climbing towards the deeper mountain: the development trend of multimodal large language models (MLLMs). MLLMs are becoming not only stronger but also encompassing more versatile modality capabilities in a real-world streaming fashion. Unlike the traditional linear progression of vision-language models, this development trend is more multidimensional, resembling an ascent towards the deeper mountains that may encompass more transformative treasures, as illustrated in the figure.

Introduction

The exciting bloom of multimodal large language models (MLLMs) starts from vision and language, with an increasingly strong capability for image understanding in the open-source community. However, our physical world is essentially a parallel continuous stream of broad multimodal information, which is still far beyond the reach of most current MLLMs. Recent breakthroughs like GPT-4o and Gemini 2.0 have taken the first steps towards this goal, setting an ambitious and promising trajectory for future developments.

To facilitate the exploration in the open-source community, we present MiniCPM-o 2.6, our latest and most capable on-device MLLM ungraded from the MiniCPM-V series. The model can take image, video, text, and audio as inputs, and provide high-quality text and speech outputs in an end-to-end fashion. With a total of 8B parameters, MiniCPM-o 2.6 achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming, making it one of the most versatile and performant models in the open-source community. Specifically, Notable features of MiniCPM-o 2.6 include:

Leading Visual Capability. MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding. It also outperforms GPT-4V and Claude 3.5 Sonnet in multi-image and video understanding, and shows promising in-context learning capability.
State-of-the-art Speech Capability. MiniCPM-o 2.6 supports bilingual real-time speech conversation with configurable voices in English and Chinese. It outperforms GPT-4o-realtime on audio understanding tasks such as speech recognition (ASR) and speech-to-text translation (AST) and shows state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
Strong Multimodal Live Streaming Capability. MiniCPM-o 2.6 can accept continuous video and audio streams independent of user queries, and support real-time speech interaction. It outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench.
Strong OCR Capability and Others. Advancing popular visual capabilities from the MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405. Based on the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports vision-language capabilities in more than 30 languages.
Superior Efficiency. In addition to its friendly size, MiniCPM-o 2.6 also shows state-of-the-art token density (i.e., the number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support multimodal live streaming on end-side devices such as iPads.