UAE researchers release 50,000 drone scenario LLM benchmark

New benchmark tests AI decision-making in flight

Nov 17, 2025

eHang eVTOL flight text in Abu Dhabi this year (Credit: Abu Dhabi Media Office)

#UAE #UAVs - Researchers from United Arab Emirates University and Khalifa University have released UAVBench, an open benchmark dataset comprised of 50,000 validated unmanned aerial vehicle (UAV) flight scenarios, which have been designed to evaluate autonomous drone systems powered by large language models (LLMs). The dataset uses taxonomy-guided prompting and multi-stage safety validation to test AI reasoning across mission planning, perception and decision-making. An accompanying extension, UAVBench_MCQ, provides 50,000 multiple-choice questions spanning ten reasoning domains including: aerodynamics, navigation, multi-agent coordination, cyber-physical security, energy management and ethical decision-making.

SO WHAT? - As autonomous drones increasingly rely on large language models for real-time decision-making in critical applications such as wildfire monitoring, search and rescue, and delivery services. However, the absence of standardised, physically grounded benchmarks has constrained systematic evaluation of AI reasoning quality. The release of UAVBench addresses a fundamental gap in the field by providing the first large-scale, open dataset that captures realistic three-dimensional flight dynamics, environmental variability and safety constraints. UAVBench will allow researchers and developers to assess whether AI systems can handle the complex physics, resource constraints and ethical dilemmas inherent in autonomous aerial operations.

Here are some key facts about the new research:

Researchers from United Arab Emirates University and Khalifa University have released UAVBench, an open benchmark developed to evaluate autonomous drone systems powered by large language models (LLMs).
The UAVBench dataset consists of 50,000 validated UAV flight scenarios generated using taxonomy-driven large language model prompting, with each scenario encoded in structured JSON format encompassing mission objectives, vehicle configuration, environmental conditions and quantitative risk labels across categories including weather, navigation, energy and collision avoidance.
The researchers evaluated 32 state-of-the-art large language models including OpenAI’s GPT-5 and ChatGPT 4o, Google Gemini 2.5 Flash, DeepSeek V3, Aliababa’s Qwen3 235B and ERNIE 4.5 300B. The reserach found strong performance in perception and policy reasoning amongst leading models, but also persistent challenges in ethics-aware and resource-constrained decision-making.
Each UAV scenario undergoes multi-stage validation checks. These include schema validation, physical and geometric consistency checks, and safety and hazard-aware risk scoring to ensure missions remain physically consistent and safety-aware across diverse operational contexts.
The UAVBench_MCQ extension transforms validated scenarios into 50,000 reasoning tasks. The tasks cover ten UAV reasoning domains: aerodynamics and physics, navigation and path planning, policy and compliance, environmental sensing, multi-agent coordination, cyber-physical security, energy management, ethical decision-making, comparative systems and hybrid integrated reasoning.
The unified scenario schema integrates simulation dynamics, vehicle configuration, environmental conditions including weather and lighting, mission objectives and safety constraints, ensuring interoperability and physical validity across different applications and use cases.
The payload taxonomy within UAVBench covers various sensor types including RGB, thermal, LiDAR and multispectral systems. Each sensor is linked to altitude, lighting and operational constraints that reflect real-world deployment parameters.
Modern UAVs operate within continuous three-dimensional environments characterised by high degrees of freedom, varying altitudes, dynamic obstacles and fluid conditions such as wind, making path planning and spatial reasoning considerably more difficult than in ground-based autonomous systems.
The complete UAVBench dataset, UAVBench_MCQ benchmark, evaluation scripts and all related materials have been released on GitHub to support open science and reproducibility in autonomous aerial systems research.
Principal researcher on this project were Mohamed Amine Ferrag PhD, Senior Member, IEEE and Associate Professor at UAE University; Abderrahmane Lakas, Senior Member, IEEE, professor and Assistant Dean for Research & Graduate Studies at UAE University; and Merouane Debbah, Fellow, IEEE, Professor at Khalifa University and Founder of the university’s 6G Research Centre.

ZOOM OUT - The UAE is emerging as a global testbed for autonomous aerial and ground mobility systems. Abu Dhabi currently operates the Middle East’s largest commercial robotaxi network, with Chinese autonomous vehicle company WeRide having accumulated over 800,000 kilometres in passenger service across half the emirate’s urban core by October 2025. Meanwhile, Abu Dhabi, Dubai and other emirates are advancing air taxi deployment with multiple global eVTOL developers. Electronic aircraft manufacturers Archer, eHang and Joby Aviation have all conducted flight tests in the UAE ahead of the first planned air taxi services in 2026. The UAE General Civil Aviation Authority has established dedicated regulatory frameworks for electric vertical take-off and landing operations, positioning the nation as a leader in urban air mobility with full vertical integration targeted by 2030.

[Written and edited with the assistance of AI]

LINKS

UAVBench research paper (arXiv)
UAVBench code (Github)

Read more about Khalifa University research:

Khalifa University announces telecom AI model benchmarks (Middle East AI News)
Telecom industry partners develop Arabic Telecom LLM (Middle East AI News)
GSMA releases telecom AI benchmarks (Middle East AI News)
UAE researchers define AI & 6G co-evolution requirements (Middle East AI News)

Rainbow Roxy

Nov 23

Wow, the physically grounded benchmarks for UAVBench are so crucial; it's exactly what my AI-loving brain thrives on, making this research realy impactful.

Expand full comment

Neural Foundry

Nov 17

The integration of AI with physical flight dynamics is fasinating. What stands out is how the researchers tested 32 diffrent models and found persistent gaps in ethics-aware decision making. This is exactly the kind of rigoros benchmarking we need as autonomous systems move from theory to deployment, especially given the safety implications.

Middle East AI News

Discussion about this post

Ready for more?