MBZUAI benchmark exposes AI vision agent flaws
Leading models score just 45% success rate in vision analysis tests
#UAE #evaluation - Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) researchers have developed VBench, a comprehensive vision-centric agentic reasoning benchmark, revealing critical flaws in advanced agentic environments. According to the new research, top-performing system achieved just 45% success rates on 828 complex visual reasoning tasks. The Agent-X prize-winning evaluation framework, developed by MBZUAI in collaboration with University of Oxford and University of Central Florida researchers, tracks every decision step rather than final answers alone, exposing fundamental limitations. The benchmark can assess AI systems’ ability to reason visually, act step-by-step and use external tools reliably in real-world scenarios spanning autonomous driving, surveillance, sports analysis, web browsing and mathematical reasoning.
SO WHAT? - Whilst AI startups, big tech and research labs make continuing advances in developing AI models and agentic systems, the race is on to develop evaluation frameworks that can assess performance in real-world scenarios. Today, as AI development focuses more on action execution frameworks rather than just natural language capabilities, more new benchmarks are needed. Developed to assess agentic systems, the VBench model measures reasoning quality across multiple steps, providing developers with deeper insights for developing more reliable systems.
Here are some key points about the new evaluation benchmark:
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) researchers, in collaboration with University of Oxford and University of Central Florida researchers, have developed VBench, a comprehensive vision-centric agentic reasoning benchmark that reveals critical flaws in advanced AI agents.
The VBench research team were awarded second place in the prestigious Agent-X Competition at the Agentic-AI Summit 2025, hosted earlier this month by Berkeley RDI at the University of California, Berkeley.
The new benchmark reveals flaws in advanced AI systems, with the top-performing AI system achieving only a 45% success rates on complex visual reasoning tasks.
The VBench evaluation framework tests 828 tasks across six environments including autonomous driving, surveillance, sports analysis, web browsing, mathematics and generic visual reasoning using 14 executable tools with authentic visual data.
Lead researcher Tajamul Ashraf identified the core problem as action reliability rather than cognitive limitations, with agents frequently hallucinating tools, ignoring formats and skipping critical reasoning steps.
Unlike previous evaluations focusing on final answers, VBench tracks complete reasoning traces measuring visual input interpretation, tool selection, argument formatting and logical consistency across decision steps.
Leading models including OpenAI’s GPT-4o and Google Deepmind’s Gemini-2.5-Pro demonstrated superior internal reasoning and consistency but failed to complete half the benchmark tasks due to formatting errors and invalid tool arguments.
One-third of system errors involved malformed tool arguments or hallucinated function calls, whilst video-based tasks proved especially challenging with models struggling to track entities across frames.
The benchmark introduces multi-level metrics rewarding coherent reasoning chains rather than correct conclusions alone, using multiple judges including open-source models and human evaluators to reduce bias.
Future development plans include expanding VBench to multilingual tasks, longer time horizons and eval-as-a-service platforms helping developers track agent performance in production environments.
VBench team members include Tajamul Ashraf, Amal Saqib, Hanan Gani, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, FREng, FRS, Fahad Khan, Muhammad Anwer Rao, and Salman Khan.
ZOOM OUT - VBench is the latest evaluation to address growing concerns about AI agent reliability as the demand for agentic systems for critical applications soars. The benchmark reflects broader industry concerns that more comprehensive evaluation methods are required to assess real-world performance capabilities. AI evaluation frameworks have become a key focus for MBZUAI's research, as it seeks to shape practices that ensure AI systems are ready for the real world. The university has previously open-sourced a number of benchmarks.
[Written and edited with the assistance of AI]
LINKS
VBench research paper (arXiv)
VBench code (Github)
VBench data (Hugging Face)
Agent-X competition entry (YouTube)
Read more about MBZUAI AI evaluation benchmarks:
New benchmark challenges inclusivity of global LLMs (Middle East AI News)
HF, Inception & MBZUAI launch Arabic LLM leaderboard (Middle East AI News)
MBZUAI launches multimodal Arabic AI benchmark (Middle East AI News)