Logo M4U-BENCHMARK

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang,
Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong, Xilin Chen
Institute of Computing Technology, Chinese Academy of Sciences,

Abstract

Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce Logo M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. Logo M4U contains 8,931 samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in Chinese, English, and German. Using Logo M4U, we conduct extensive evaluations of 21 leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results show that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on Logo M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, suffer performance degradation when prompted with cross-lingual multimodal questions, such as images with key textual information in Chinese while the question is in German. We believe that Logo M4U can serve as a crucial tool for systematically evaluating LMMs based on their multilingual multimodal reasoning capabilities and monitoring their development.

arithmetic reasoning

An illustration of multi-discipline multilingual multimodal understanding. Both textual questions and images contain the multilingual contents. We highlight the Chinese contents in yellow. English translations are provided for better readability.

Our codes and data are available at Hugging Face Dataset.

Logo M4U Dataset

Overview

arithmetic reasoning

The detailed statistics of Logo M4U dataset.

To boost the development of multilingual multimodal models, in this work, we introduce Logo M4U, a novel and challenging benchmark for evaluating foundation models on the expert-level multilingual multimodal understanding and reasoning. Specifically, we recruit a team of over 10 college students and graduate students to collect a high-quality data and assess its difficulty and correctness. Logo M4U consists of 8,931 multiple choice questions, covering 64 disciplines of 16 subfields from Science, Engineering and Healthcare in Chinese, English and German.. To minimize the risk of data contamination, the samples are collected from college exams, the quizzes of online video lectures. Further a large portion 35% of Logo M4U are written by our team according to the textbooks.

arithmetic reasoning

With Logo M4U, we conduct extensive evaluations for 21 leading LMMs and LLMs with external tools. The evaluation results show that the state-of-the-art model, GPT-4o, only achieves 47.6% average accuracy on M4U. Besides, we observe that the leading LMMs have significant language preferences.

arithmetic reasoning

Our in-depth analysis shows that the leading LMMs, including GPT-4o, suffer from the performance degradation when they are prompted with cross-lingual multimodal questions, e.g., the images have key textual information in Chinese, while the question is in English or German.

arithmetic reasoning

We further analyze the impact of different types of visual content and image positions. The experimental results show that GPT-4o significantly outperforms the other models in medical images, and LLaVA-NeXT has difficulty answering questions where images are included in the options.

Logo Experimental Results

Leaderboard

# Model Method Source Size Chinese English German Average
1 GPT-4o LMM Link - 47.8 49.4 45.6 47.6
2 GPT-4V + CoT LMM Link - 43.9 43.6 40.3 42.6
3 GPT-4V LMM Link - 39.7 39.4 37.3 38.8
4 LLaVA-NeXT-34B LMM Link 34B 38.5 36.2 35.2 36.6
5 Gemini 1.0 Pro + CoT LMM Link - 34.4 34.2 33.9 34.2
6 Gemini 1.0 Pro LMM Link - 34.9 32.7 30.8 32.8
6 Qwen-1.5-14B-Chat + Caption Tool Link 14B 32.7 32.0 33.8 32.8
8 Yi-VL-34B LMM Link 34B 33.5 33.3 30.5 32.4
9 Yi-VL-6B LMM Link 6B 33.4 31.4 29.7 31.5
10 DeepSeek-VL LMM Link 7B 30.4 32.8 30.8 31.3
11 Qwen-1.5-7B-Chat + Caption Tool Link 7B 34.2 27.7 31.7 31.2
11 Gemini 1.0 Pro + Caption Tool Link - 31.6 31.1 30.9 31.2
13 InternLM-XComposer LMM Link 7B 31.8 31.6 29.1 30.8
14 LLaVA-NeXT-Mistral-7B LMM Link 7B 28.2 30.6 29.4 29.4
15 CogVLM-Chat LMM Link 7B 28.9 30.2 28.5 29.2
16 Qwen-VL-Chat LMM Link 7B 29.7 29.9 27.1 28.9
17 LLaVA-NeXT-Vicuna-13B LMM Link 13B 21.9 30.9 29.3 27.4
18 Mistral-Instruct-v0.2-7B + Caption Tool Link 7B 24.9 24.9 26.9 25.6
19 LLaVA-NeXT-Vicuna-7B LMM Link 7B 11.8 29.8 28.2 23.3
20 InstructBLIP-Vicuna-7B LMM Link 7B 13.7 28.1 19.7 20.5
21 InstructBLIP-Vicuna-13B LMM Link 13B 10.5 23.4 18.6 17.5
22 Ying-VLM LMM Link 13B 22.3 11.2 15.6 16.4
23 VisualGLM LMM Link 6B 8.7 22.4 13.5 14.9

Method types: LMM 🖼️: Large Multimodal Model, Tool 🛠️: Tool-augmented Large Language Model. The captions are generated by Gemini 1.0 Pro.

🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard, please send to Hongyu Wang and Ruiping Wang with your result json files.

🔮 The evaluation instructions are available at Evaluations on M4U.

Qualitative Analysis

we conduct qualitative analysis for the results of GPT-4V with the chain-of-thought prompting. We randomly sample 75 questions (2.5%) from different disciplines of each language. In these instances, GPT-4V has errors in responses and analysis in at least one language. We analyze the cause of these wrong cases, and divided them into 6 categories: perceptual error, lack of knowledge, reasoning error, textual understanding, annotation error and answer extraction error. Perceptual error, lack of knowledge, and reasoning error account for the major causes of failed cases (96% in Chinese, 95% in English, and 92% in German). GPT-4V tends to exhibit lack of knowledge on the Chinese part of Logo M4U, while reasoning errors are more likely to occur in German and English. These findings demonstrate that LMMs still have significant room for improvement, particularly in multilingual multimodal reasoning.

arithmetic reasoning

Visualization Examples