# | Model | Method | Source | Size | Chinese | English | German | Average |
1 | GPT-4o | LMM | Link | - | 47.8 | 49.4 | 45.6 | 47.6 |
2 | GPT-4V + CoT | LMM | Link | - | 43.9 | 43.6 | 40.3 | 42.6 |
3 | GPT-4V | LMM | Link | - | 39.7 | 39.4 | 37.3 | 38.8 |
4 | LLaVA-NeXT-34B | LMM | Link | 34B | 38.5 | 36.2 | 35.2 | 36.6 |
5 | Gemini 1.0 Pro + CoT | LMM | Link | - | 34.4 | 34.2 | 33.9 | 34.2 |
6 | Gemini 1.0 Pro | LMM | Link | - | 34.9 | 32.7 | 30.8 | 32.8 |
6 | Qwen-1.5-14B-Chat + Caption | Tool | Link | 14B | 32.7 | 32.0 | 33.8 | 32.8 |
8 | Yi-VL-34B | LMM | Link | 34B | 33.5 | 33.3 | 30.5 | 32.4 |
9 | Yi-VL-6B | LMM | Link | 6B | 33.4 | 31.4 | 29.7 | 31.5 |
10 | DeepSeek-VL | LMM | Link | 7B | 30.4 | 32.8 | 30.8 | 31.3 |
11 | Qwen-1.5-7B-Chat + Caption | Tool | Link | 7B | 34.2 | 27.7 | 31.7 | 31.2 |
11 | Gemini 1.0 Pro + Caption | Tool | Link | - | 31.6 | 31.1 | 30.9 | 31.2 |
13 | InternLM-XComposer | LMM | Link | 7B | 31.8 | 31.6 | 29.1 | 30.8 |
14 | LLaVA-NeXT-Mistral-7B | LMM | Link | 7B | 28.2 | 30.6 | 29.4 | 29.4 |
15 | CogVLM-Chat | LMM | Link | 7B | 28.9 | 30.2 | 28.5 | 29.2 |
16 | Qwen-VL-Chat | LMM | Link | 7B | 29.7 | 29.9 | 27.1 | 28.9 |
17 | LLaVA-NeXT-Vicuna-13B | LMM | Link | 13B | 21.9 | 30.9 | 29.3 | 27.4 |
18 | Mistral-Instruct-v0.2-7B + Caption | Tool | Link | 7B | 24.9 | 24.9 | 26.9 | 25.6 |
19 | LLaVA-NeXT-Vicuna-7B | LMM | Link | 7B | 11.8 | 29.8 | 28.2 | 23.3 |
20 | InstructBLIP-Vicuna-7B | LMM | Link | 7B | 13.7 | 28.1 | 19.7 | 20.5 |
21 | InstructBLIP-Vicuna-13B | LMM | Link | 13B | 10.5 | 23.4 | 18.6 | 17.5 |
22 | Ying-VLM | LMM | Link | 13B | 22.3 | 11.2 | 15.6 | 16.4 |
23 | VisualGLM | LMM | Link | 6B | 8.7 | 22.4 | 13.5 | 14.9 |
Leaderboard on M4U-mini
# | Model | Method | Source | Size | Chinese | English | German | Japanese | Arabic | Thai | Average |
1 | GPT-4o | LMM | Link | - | 53.7 | 44.9 | 42.4 | 49.1 | 45.2 | 48.8 | 47.3 |
2 | InternVL2.5-26B | LMM | Link | 26B | 51.3 | 44.2 | 48.1 | 46.4 | 37.6 | 47.3 | 44.2 |
3 | Qwen2-VL-7B-Instruct | LMM | Link | 7B | 46.6 | 43.5 | 44.1 | 47.6 | 41.5 | 41.4 | 44.1 |
4 | Gemini-1.5-Flash | LMM | Link | - | 46.3 | 35.4 | 42.8 | 39.0 | 38.4 | 40.1 | 40.3 |
5 | InternVL2.5-8B | LMM | Link | 8B | 38.5 | 41.7 | 38.3 | 36.1 | 31.4 | 31.7 | 36.3 |
6 | LLaVA-NeXT-34B | LMM | Link | 34B | 44.2 | 44.1 | 39.0 | 36.0 | 11.4 | 34.0 | 34.8 |
7 | Phi-3.5-Vision-Instruct | LMM | Link | 4.2B | 27.2 | 34.3 | 33.4 | 30.4 | 31.7 | 30.9 | 31.3 |
8 | DeepSeek-VL-Chat | LMM | Link | 7B | 33.6 | 35.4 | 35.0 | 32.1 | 24.8 | 25.4 | 31.0 |
9 | Qwen2.5-14B-Instruct | Tool | Link | 14B | 25.7 | 35.0 | 25.7 | 13.6 | 35.7 | 13.8 | 24.9 |
10 | Qwen1.5-14B-Chat | Tool | Link | 14B | 17.7 | 28.9 | 29.5 | 19.3 | 26.9 | 12.0 | 22.4 |
Method types: LMM 🖼️: Large Multimodal Model, Tool 🛠️: Tool-augmented Large Language Model. The captions are generated by Gemini 1.0 Pro.
🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard, please send to Hongyu Wang and Ruiping Wang with your result json files.
🔮 The evaluation instructions are available at Evaluations on M4U.
Qualitative Analysis
we conduct qualitative analysis for the results of GPT-4V with the chain-of-thought prompting. We randomly sample 75 questions (2.5%) from different disciplines of each language. In these instances, GPT-4V has errors in responses and analysis in at least one language. We analyze the cause of these wrong cases, and divided them into 6 categories: perceptual error, lack of knowledge, reasoning error, textual understanding, annotation error and answer extraction error. Perceptual error, lack of knowledge, and reasoning error account for the major causes of failed cases (96% in Chinese, 95% in English, and 92% in German). GPT-4V tends to exhibit lack of knowledge on the Chinese part of
M4U, while reasoning errors are more likely to occur in German and English. These findings demonstrate that LMMs still have significant room for improvement, particularly in multilingual multimodal reasoning.

Visualization Examples
















