# | Model | Method | Source | Size | Chinese | English | German | Average |
1 | GPT-4o | LMM | Link | - | 47.8 | 49.4 | 45.6 | 47.6 |
2 | GPT-4V + CoT | LMM | Link | - | 43.9 | 43.6 | 40.3 | 42.6 |
3 | GPT-4V | LMM | Link | - | 39.7 | 39.4 | 37.3 | 38.8 |
4 | LLaVA-NeXT-34B | LMM | Link | 34B | 38.5 | 36.2 | 35.2 | 36.6 |
5 | Gemini 1.0 Pro + CoT | LMM | Link | - | 34.4 | 34.2 | 33.9 | 34.2 |
6 | Gemini 1.0 Pro | LMM | Link | - | 34.9 | 32.7 | 30.8 | 32.8 |
6 | Qwen-1.5-14B-Chat + Caption | Tool | Link | 14B | 32.7 | 32.0 | 33.8 | 32.8 |
8 | Yi-VL-34B | LMM | Link | 34B | 33.5 | 33.3 | 30.5 | 32.4 |
9 | Yi-VL-6B | LMM | Link | 6B | 33.4 | 31.4 | 29.7 | 31.5 |
10 | DeepSeek-VL | LMM | Link | 7B | 30.4 | 32.8 | 30.8 | 31.3 |
11 | Qwen-1.5-7B-Chat + Caption | Tool | Link | 7B | 34.2 | 27.7 | 31.7 | 31.2 |
11 | Gemini 1.0 Pro + Caption | Tool | Link | - | 31.6 | 31.1 | 30.9 | 31.2 |
13 | InternLM-XComposer | LMM | Link | 7B | 31.8 | 31.6 | 29.1 | 30.8 |
14 | LLaVA-NeXT-Mistral-7B | LMM | Link | 7B | 28.2 | 30.6 | 29.4 | 29.4 |
15 | CogVLM-Chat | LMM | Link | 7B | 28.9 | 30.2 | 28.5 | 29.2 |
16 | Qwen-VL-Chat | LMM | Link | 7B | 29.7 | 29.9 | 27.1 | 28.9 |
17 | LLaVA-NeXT-Vicuna-13B | LMM | Link | 13B | 21.9 | 30.9 | 29.3 | 27.4 |
18 | Mistral-Instruct-v0.2-7B + Caption | Tool | Link | 7B | 24.9 | 24.9 | 26.9 | 25.6 |
19 | LLaVA-NeXT-Vicuna-7B | LMM | Link | 7B | 11.8 | 29.8 | 28.2 | 23.3 |
20 | InstructBLIP-Vicuna-7B | LMM | Link | 7B | 13.7 | 28.1 | 19.7 | 20.5 |
21 | InstructBLIP-Vicuna-13B | LMM | Link | 13B | 10.5 | 23.4 | 18.6 | 17.5 |
22 | Ying-VLM | LMM | Link | 13B | 22.3 | 11.2 | 15.6 | 16.4 |
23 | VisualGLM | LMM | Link | 6B | 8.7 | 22.4 | 13.5 | 14.9 |
Method types: LMM 🖼️: Large Multimodal Model, Tool 🛠️: Tool-augmented Large Language Model. The captions are generated by Gemini 1.0 Pro.
🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard, please send to Hongyu Wang and Ruiping Wang with your result json files.
🔮 The evaluation instructions are available at Evaluations on M4U.