MiniCPM-o Team, OpenBMB

➤ Project Website

🔗 https://github.com/OpenBMB/MiniCPM-o

➤ Model Weights

🔗 https://huggingface.co/openbmb/MiniCPM-o-2_6

🔗 https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6

➤ Demo

🔗 https://minicpm-omni-webdemo-us.modelbest.cn/

image.png

⛰️ Climbing towards the deeper mountain: the development trend of multimodal large language models (MLLMs). MLLMs are becoming not only stronger but also encompassing more versatile modality capabilities in a real-world streaming fashion. Unlike the traditional linear progression of vision-language models, this development trend is more multidimensional, resembling an ascent towards the deeper mountains that may encompass more transformative treasures, as illustrated in the figure.

Introduction

The exciting bloom of multimodal large language models (MLLMs) starts from vision and language, with an increasingly strong capability for image understanding in the open-source community. However, our physical world is essentially a parallel continuous stream of broad multimodal information, which is still far beyond the reach of most current MLLMs. Recent breakthroughs like GPT-4o and Gemini 2.0 have taken the first steps towards this goal, setting an ambitious and promising trajectory for future developments.

To facilitate the exploration in the open-source community, we present MiniCPM-o 2.6, our latest and most capable on-device MLLM ungraded from the MiniCPM-V series. The model can take image, video, text, and audio as inputs, and provide high-quality text and speech outputs in an end-to-end fashion. With a total of 8B parameters, MiniCPM-o 2.6 achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming, making it one of the most versatile and performant models in the open-source community. Specifically, Notable features of MiniCPM-o 2.6 include: