Rank | Model | LLM Params |
Frames | Date | Overall (%) | Real-Time Visual Understanding (%) | Omni-Source Understanding (%) | Contextual Understanding (%) |
---|---|---|---|---|---|---|---|---|
Gemini 1.5 Pro
|
- | Video | 2024-06-15 | 70.26 | 77.39 | 67.80 | 51.06 | |
MiniCPM-o 2.6
OpenBMB |
8B | 60 | 2025-01-14 | 66.01 | 79.88 | 53.40 | 38.45 | |
GPT-4o
OpenAI |
- | 60 | 2024-06-15 | 64.10 | 74.54 | 50.95 | 47.94 | |
InternLM-XC2.5-OL
Shanghai AI Lab |
7B | 64 | 2024-12-12 | 60.80 | 75.36 | 46.20 | 33.58 | |
Claude 3.5 Sonnet
Anthropic |
- | 20 | 2024-07-30 | 59.71 | 74.04 | 41.40 | 37.83 | |
LLaVA-OneVision
Bytedance & NTU S-Lab |
7B | 32 | 2024-08-08 | 58.43 | 74.27 | 40.83 | 30.96 | |
Qwen2-VL
Alibaba |
7B | 768 | 2024-08-19 | 56.99 | 71.15 | 40.73 | 33.08 | |
MiniCPM-V 2.6
OpenBMB |
8B | 64 | 2024-08-12 | 57.67 | 72.43 | 40.23 | 33.39 | |
VITA-1.5
NJU |
7B | 16 | 2025-01-03 | 57.36 | 70.88 | 40.80 | 35.83 | |
InternVL2
Shanghai AI Lab |
8B | 32 | 2024-07-18 | 57.04 | 70.11 | 42.73 | 34.10 | |
LLaVA-NeXT-Video
Bytedance & NTU S-Lab |
32B | 32 | 2024-05-10 | 56.68 | 69.83 | 41.73 | 34.29 | |
Kangaroo
Meituan & UCAS |
8B | 64 | 2024-07-23 | 53.34 | 65.76 | 40.04 | 31.18 | |
LongVA
NTU S-Lab |
7B | 128 | 2024-06-25 | 50.66 | 63.11 | 35.93 | 30.21 | |
VILA-1.5
NVIDIA & MIT |
8B | 14 | 2024-07-21 | 49.46 | 61.54 | 37.53 | 26.66 | |
Video-CCAM
QQMM |
14B | 32 | 2024-07-16 | 42.53 | 53.42 | 32.22 | 24.06 | |
VideoLLaMA 2
Alibaba |
7B | 32 | 2024-08-29 | 43.33 | 52.58 | 35.92 | 23.69 |