M42 releases MEDIC leaderboard to benchmark clinical LLMs
New leaderboard aims to empower safe, effective healthcare LLMs
#UAE #healthcare - Abu Dhabi-based global healthcare group M42, part of the G42 Group, has released the MEDIC Leaderboard to evaluate large language models (LLMs) for clinical applications. Designed to assess LLMs across five critical dimensions —medical reasoning, ethics, safety, data understanding, and in-context learning — the framework ensures models meet real-world healthcare demands and safety requirements. The release of the leaderboard follows M42’s work last year on developing the MEDIC framework for evaluating clinical LLMs. The framework (published in September 2024) evaluates models on tasks like summarisation, clinical note generation, and hallucination detection.
SO WHAT? - The MEDIC Leaderboard addresses the urgent need for rigorous evaluation of LLMs in healthcare, ensuring safety, ethical compliance, and real-world effectiveness. There remains a significant gap between the enormous promise of GenAI and the reality, due to the healthcare sectors zero tolerance for errors concerning data, medical information and advice. By quantifying model strengths and limitations, MEDIC aims to improve AI reliability in sensitive medical contexts, supporting advancements in clinical NLP. The new MEDIC Leaderboard allows researchers and developers to test and benchmark their models to inform further development of safe, compliant and effective clinical LLMs
Here are some key facts about MEDIC:
M42, part of Abu Dhabi’s G42 Group, has release the MEDIC Leaderboard to benchmark large language models (LLMs) in clinical settings. The availability of the new leaderboard follows on from ongoing research by M42 to develop a framework to holistically evaluate LLMs for clinical use.
The MEDIC framework evaluates models across five critical dimensions. In fact, MEDIC is an acronym that stands for Medical reasoning, Ethical and bias concerns, Data and language understanding, In-context learning,
and Clinical safety and risk assessment.
The MEDIC Leaderboard introduces a novel cross-examination approach to evaluate performance without relying on reference outputs. The assessment includes both open-ended and closed-ended medical questions, clinical summarisation, note generation, and hallucination detection.
Early findings show significant performance differences between general-purpose and medically fine-tuned LLMs, highlighting trade-offs like hallucination rates and cost-efficiency.
The tool aims to bridge the gap between theoretical AI capabilities and practical healthcare applications, driving safer and more effective clinical NLP solutions.
Researchers, developers and practitioners are invited to submit clinical models, compare results, and contribute feedback to improve the tool and open-source resources available to clinical LLM projects.
Note: M42 is careful to point out that models evaluated and benchmarked by MEDIC are not clinically validated and are for academic research only.
ZOOM OUT - M42 is a pioneer in applying Generative AI for clinical applications and first released its Med42 clinical large language model in 2023. Last year the G42 company announced version two of Med42, a significant upgrade to its first model and also its clinical LLM evaluation framework MEDIC. M42 is developing clinical LLMs in collaboration with a variety of partners including members of Abu Dhabi’s advanced digitally-enabled healthcare system, among them The Department of Health – Abu Dhabi and Cleveland Clinic Abu Dhabi.
LINKS
MEDIC Leaderboard (Hugging Face)
MEDIC framework page (Hugging Face)
MEDIC research paper (arXiv)
Find out more about clinical LLMs:
M42 delivers framework for evaluating clinical LLMs (Middle East AI News)
M42 releases new versions of clinical LLM (Middle East AI News)
🎧 Podcast: M42’s new clinical 70B LLM (Middle East AI News)