M42 delivers groundbreaking framework for evaluating clinical LLMs
New MEDIC framework assesses clinical safety, ethics, and hallucination risks
#UAE #LLMs - A team of AI researchers at G42’s global healthcare group, M42 have developed a framework designed to holistically evaluate large language models (LLMs) for clinical use. Called MEDIC, the evaluation framework assesses five key areas: medical reasoning, ethics and bias, language understanding, in-context learning, and clinical safety. By identifying differences in performance across model sizes and applications, MEDIC can offer critical insights into LLM selection for clinical use, ensuring improved safety and efficacy in healthcare applications.
SO WHAT? - Generative AI is expected to have a transformational effect on healthcare globally, helping to reduce pressure on physicians, providing specialised medical knowledge to clinicians on demand and elevating levels of patient service, patient knowledge and preventative healthcare. This said, there remains a significant gap between the promise of GenAI and the reality, and this is largely because the healthcare sector has a zero tolerance for errors concerning data, medical information and advice. AI developers such as M42 are driving rapid advances in developing new large language models and making them more useful, more reliable and so safer for use by healthcare professionals and patients. However, evaluation of such models needs to go far beyond the performance evaluations of models for many other sectors. It is hoped that the new MEDIC evaluation framework will help clinicians, AI developers and policymakers a basis to develop, agree and eventually standardise evaluation of clinical large language models.
Here are some key points about the new evaluation framework:
Abu Dhabi-headquartered global healthcare group M42, part of the G42 group, has announced a holistic and comprehensive framework designed to provide a holistic view of LLM capabilities in clinical contexts.
Called MEDIC, the evaluation framework aims to address the gap between theoretical LLM capabilities and real-world implementation in clinical settings. The framework will allow developers, technicians, clinicians and policymakers to assess LLMs across multiple dimensions critical to healthcare.
MEDIC is an acronym that stands for Medical reasoning, Ethical and bias concerns, Data and language understanding, In-context learning,
and Clinical safety and risk assessment.
MEDIC incorporates a novel cross-examination framework to evaluate performance on tasks like medical question-answering, summarisation, and clinical note generation.
The framework measures LLM performance without relying on reference outputs, quantifying areas like coverage and hallucination detection.
MEDIC can help identify performance disparities between baseline models and those fine-tuned for medical applications, offering insights into where specific models excel or underperform.
Developed by M42’s AI research team in Abu Dhabi, in collaboration with other AI and medical professionals, MEDIC is a first-of-its-kind framework that will help ensure LLMs used in healthcare are optimally evaluated for safety, ethical considerations, and practical use.
M42 is currently working on how to best enable other researchers to run their own evaluations. The code needed to run such evaluations is expected to be open-sourced by the research team in the coming weeks.
ZOOM OUT - Last year M42 released Med42, the company’s first clinical large language model, for review and testing by academic and healthcare institutions around the world. In July of this year, M42 announced version two of Med42, as a significant upgrade to its first model, developed taking into account feedback from the global healthcare community. Released as 8 billion parameter and 70 billion parameter versions, Med42 V2 was built on Meta's LLaMA-3. The MEDIC evaluation framework is another output from the Med42 development programme.
Read more about M42’s large language models:
M42 releases new versions of clinical LLM (Middle East AI News)
🎧 Podcast: M42’s new clinical 70B LLM (Middle East AI News)
LINKS
MEDIC page (Hugging Face)
MEDIC research paper (arXiv)