CRAFT-MD is a comprehensive evaluation framework designed to test the conversational reasoning abilities of clinical Large Language Models (LLMs) beyond traditional accuracy measured using MCQs.
It simulates doctor-patient interactions, where the clinical-LLM's performance in gathering medical histories, synthesizing information, and forming accurate diagnoses is assessed by a multi-agent setup involving a patient-AI, a grader-AI, and medical experts who validate the results.
CRAFT-MD is designed to be flexible and scalable, allowing for the integration of new datasets and the evaluation of emerging models.
⭐ Submit your models for evaluation with CRAFT-MD.
The CRAFT-MD leaderboard ranks models based on their performance in Multi-turn Conversations in free response question (FRQ) setting.
Rank | Combined Dataset |
---|---|
1 |
O1-preview-0912
OpenAI |
2 |
GPT-4o-0806
OpenAI |
3 |
O1-mini-0912
OpenAI |
4 |
GPT-4-1106
OpenAI |
5 |
LLaMA-3.1-8b
Meta AI |
6 |
LLaMA-3-8b
Meta AI |
7 |
GPT-3.5-1106
OpenAI |
8 |
Qwen-2.5-7b
Alibaba Cloud |
9 |
Qwen-2-7b
Alibaba Cloud |
10 |
Mistral-v0.1-7b
Mistral AI |
11 |
LLaMA-2-7b
Meta AI |
12 |
Mistral-v0.2-7b
Mistral AI |
The evaluation dataset consists of 2000 questions, each structured as a case vignette followed by four answer choices. Of these, 1800 were sourced from MedQA-USMLE, encompassing medical conditions commonly encountered in primary and specialist care settings. These questions span 12 medical specialties: Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology, and Others. Additional 100 vignettes from an online question back and 100 newly generated vignettes are also included.
Rank | Model | Vignette ↑ | Multi-turn Conversation ↑ | Single-turn Conversation ↑ | Summarized Conversation ↑ |
---|---|---|---|---|---|
1 2024 |
O1-preview-0912
OpenAI |
0.51 | 0.365 | 0.212 | 0.321 |
2 2024 |
GPT-4o-0806
OpenAI |
0.562 | 0.335 | 0.191 | 0.35 |
3 2024 |
O1-mini-0912
OpenAI |
0.601 | 0.301 | 0.131 | 0.305 |
4 2023 |
GPT-4-1106
OpenAI |
0.486 | 0.264 | 0.133 | 0.272 |
5 2024 |
LLaMA-3.1-8b
Meta AI |
0.346 | 0.187 | 0.116 | 0.163 |
6 2024 |
LLaMA-3-8b
Meta AI |
0.291 | 0.174 | 0.117 | 0.16 |
7 2023 |
GPT-3.5-1106
OpenAI |
0.375 | 0.169 | 0.123 | 0.174 |
8 2024 |
Qwen-2.5-7b
Alibaba Cloud |
0.242 | 0.117 | 0.099 | 0.098 |
9 2024 |
Qwen-2-7b
Alibaba Cloud |
0.174 | 0.078 | 0.07 | 0.065 |
10 2023 |
Mistral-v0.1-7b
Mistral AI |
0.174 | 0.078 | 0.062 | 0.083 |
11 2023 |
LLaMA-2-7b
Meta AI |
0.169 | 0.066 | 0.065 | 0.081 |
12 2023 |
Mistral-v0.2-7b
Mistral AI |
0.222 | 0.066 | 0.056 | 0.056 |
1 2024 |
O1-preview-0912
OpenAI |
0.931 | 0.745 | 0.624 | 0.768 |
2 2024 |
GPT-4o-0806
OpenAI |
0.879 | 0.68 | 0.57 | 0.729 |
3 2024 |
O1-mini-0912
OpenAI |
0.899 | 0.639 | 0.508 | 0.688 |
4 2024 |
GPT-4-1106
OpenAI |
0.821 | 0.627 | 0.52 | 0.671 |
5 2024 |
LLaMA-3.1-8b
Meta AI |
0.723 | 0.51 | 0.421 | 0.56 |
6 2024 |
LLaMA-3-8b
Meta AI |
0.688 | 0.497 | 0.417 | 0.54 |
7 2024 |
GPT-3.5-1106
OpenAI |
0.659 | 0.467 | 0.435 | 0.509 |
8 2024 |
Qwen-2.5-7b
Alibaba Cloud |
0.66 | 0.437 | 0.411 | 0.465 |
9 2024 |
Mistral-v0.2-7b
Mistral AI |
0.637 | 0.426 | 0.448 | 0.513 |
10 2024 |
Qwen-2-7b
Alibaba Cloud |
0.591 | 0.363 | 0.355 | 0.431 |
11 2024 |
Mistral-v0.1-7b
Mistral AI |
0.441 | 0.331 | 0.324 | 0.361 |
12 2023 |
LLaMA-2-7b
Meta AI |
0.395 | 0.319 | 0.304 | 0.335 |