M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Abstract

Multilingual capability is an essential aspect for large multimodal models, since they are usually deployed across various countries and languages. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce Logo M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. Logo M4U contains 10k samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in six languages. Using Logo M4U, we conduct extensive evaluations of leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results demonstrate that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on Logo M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, struggle to perform reasoning using multilingual information present in both visual and textual context. Specifically, they suffer performance degradation when prompted with cross-lingual multimodal questions.

An illustration of multi-discipline multilingual multimodal understanding. Both textual questions and images contain the multilingual contents. We highlight the Chinese contents in yellow. English translations are provided for better readability.

Our codes and data are available at Hugging Face Dataset.

Overview

The detailed statistics of Logo M4U dataset.

To boost the development of multilingual multimodal models, in this work, we introduce Logo M4U, a novel and challenging benchmark for evaluating foundation models on the expert-level multilingual multimodal understanding and reasoning. Specifically, we recruit a team of over 10 college students and graduate students to collect a high-quality data and assess its difficulty and correctness. Logo M4U consists of 10005 multiple choice questions, covering 64 disciplines of 16 subfields from Science, Engineering and Healthcare. To minimize the risk of data contamination, the samples are collected from college exams, the quizzes of online video lectures. Further a large portion 35% of Logo M4U are written by our team according to the textbooks.

With Logo M4U, we conduct extensive evaluations for 22 leading LMMs and 4 LLMs with external tools. The evaluation results show that the state-of-the-art model, GPT-4o, only achieves 47.6% average accuracy on M4U. Besides, we observe that the leading LMMs have significant language preferences.

Our in-depth analysis shows that the leading LMMs, including GPT-4o, suffer from the performance degradation when they are prompted with cross-lingual multimodal questions, e.g., the images have key textual information in Chinese, while the question is in English or German.

We further analyze the impact of different types of visual content and image positions. The experimental results show that GPT-4o significantly outperforms the other models in medical images, and LLaVA-NeXT has difficulty answering questions where images are included in the options.

Leaderboard on M4U

#	Model	Method	Source	Size	Chinese	English	German	Average
1	GPT-4o	LMM	Link	-	47.8	49.4	45.6	47.6
2	GPT-4V + CoT	LMM	Link	-	43.9	43.6	40.3	42.6
3	GPT-4V	LMM	Link	-	39.7	39.4	37.3	38.8
4	LLaVA-NeXT-34B	LMM	Link	34B	38.5	36.2	35.2	36.6
5	Gemini 1.0 Pro + CoT	LMM	Link	-	34.4	34.2	33.9	34.2
6	Gemini 1.0 Pro	LMM	Link	-	34.9	32.7	30.8	32.8
6	Qwen-1.5-14B-Chat + Caption	Tool	Link	14B	32.7	32.0	33.8	32.8
8	Yi-VL-34B	LMM	Link	34B	33.5	33.3	30.5	32.4
9	Yi-VL-6B	LMM	Link	6B	33.4	31.4	29.7	31.5
10	DeepSeek-VL	LMM	Link	7B	30.4	32.8	30.8	31.3
11	Qwen-1.5-7B-Chat + Caption	Tool	Link	7B	34.2	27.7	31.7	31.2
11	Gemini 1.0 Pro + Caption	Tool	Link	-	31.6	31.1	30.9	31.2
13	InternLM-XComposer	LMM	Link	7B	31.8	31.6	29.1	30.8
14	LLaVA-NeXT-Mistral-7B	LMM	Link	7B	28.2	30.6	29.4	29.4
15	CogVLM-Chat	LMM	Link	7B	28.9	30.2	28.5	29.2
16	Qwen-VL-Chat	LMM	Link	7B	29.7	29.9	27.1	28.9
17	LLaVA-NeXT-Vicuna-13B	LMM	Link	13B	21.9	30.9	29.3	27.4
18	Mistral-Instruct-v0.2-7B + Caption	Tool	Link	7B	24.9	24.9	26.9	25.6
19	LLaVA-NeXT-Vicuna-7B	LMM	Link	7B	11.8	29.8	28.2	23.3
20	InstructBLIP-Vicuna-7B	LMM	Link	7B	13.7	28.1	19.7	20.5
21	InstructBLIP-Vicuna-13B	LMM	Link	13B	10.5	23.4	18.6	17.5
22	Ying-VLM	LMM	Link	13B	22.3	11.2	15.6	16.4
23	VisualGLM	LMM	Link	6B	8.7	22.4	13.5	14.9

Leaderboard on M4U-mini

#	Model	Method	Source	Size	Chinese	English	German	Japanese	Arabic	Thai	Average
1	GPT-4o	LMM	Link	-	53.7	44.9	42.4	49.1	45.2	48.8	47.3
2	InternVL2.5-26B	LMM	Link	26B	51.3	44.2	48.1	46.4	37.6	47.3	44.2
3	Qwen2-VL-7B-Instruct	LMM	Link	7B	46.6	43.5	44.1	47.6	41.5	41.4	44.1
4	Gemini-1.5-Flash	LMM	Link	-	46.3	35.4	42.8	39.0	38.4	40.1	40.3
5	InternVL2.5-8B	LMM	Link	8B	38.5	41.7	38.3	36.1	31.4	31.7	36.3
6	LLaVA-NeXT-34B	LMM	Link	34B	44.2	44.1	39.0	36.0	11.4	34.0	34.8
7	Phi-3.5-Vision-Instruct	LMM	Link	4.2B	27.2	34.3	33.4	30.4	31.7	30.9	31.3
8	DeepSeek-VL-Chat	LMM	Link	7B	33.6	35.4	35.0	32.1	24.8	25.4	31.0
9	Qwen2.5-14B-Instruct	Tool	Link	14B	25.7	35.0	25.7	13.6	35.7	13.8	24.9
10	Qwen1.5-14B-Chat	Tool	Link	14B	17.7	28.9	29.5	19.3	26.9	12.0	22.4

Method types: LMM 🖼️: Large Multimodal Model, Tool 🛠️: Tool-augmented Large Language Model. The captions are generated by Gemini 1.0 Pro.

🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard, please send to Hongyu Wang and Ruiping Wang with your result json files.

🔮 The evaluation instructions are available at Evaluations on M4U.

Qualitative Analysis

we conduct qualitative analysis for the results of GPT-4V with the chain-of-thought prompting. We randomly sample 75 questions (2.5%) from different disciplines of each language. In these instances, GPT-4V has errors in responses and analysis in at least one language. We analyze the cause of these wrong cases, and divided them into 6 categories: perceptual error, lack of knowledge, reasoning error, textual understanding, annotation error and answer extraction error. Perceptual error, lack of knowledge, and reasoning error account for the major causes of failed cases (96% in Chinese, 95% in English, and 92% in German). GPT-4V tends to exhibit lack of knowledge on the Chinese part of Logo M4U, while reasoning errors are more likely to occur in German and English. These findings demonstrate that LMMs still have significant room for improvement, particularly in multilingual multimodal reasoning.

M4U-BENCHMARK