CRAFT-MD

What is CRAFT-MD?

CRAFT-MD is a comprehensive evaluation framework designed to test the conversational reasoning abilities of clinical Large Language Models (LLMs) beyond traditional accuracy measured using MCQs.

It simulates doctor-patient interactions, where the clinical-LLM's performance in gathering medical histories, synthesizing information, and forming accurate diagnoses is assessed by a multi-agent setup involving a patient-AI, a grader-AI, and medical experts who validate the results.

CRAFT-MD is designed to be flexible and scalable, allowing for the integration of new datasets and the evaluation of emerging models.

⭐ Submit your models for evaluation with CRAFT-MD.

Leaderboard Overview

The CRAFT-MD leaderboard ranks models based on their performance in Multi-turn Conversations in free response question (FRQ) setting.

Rank	Combined Dataset
1	O1-preview-0912 OpenAI
2	GPT-4o-0806 OpenAI
3	O1-mini-0912 OpenAI
4	GPT-4-1106 OpenAI
5	LLaMA-3.1-8b Meta AI
6	LLaMA-3-8b Meta AI
7	GPT-3.5-1106 OpenAI
8	Qwen-2.5-7b Alibaba Cloud
9	Qwen-2-7b Alibaba Cloud
10	Mistral-v0.1-7b Mistral AI
11	LLaMA-2-7b Meta AI
12	Mistral-v0.2-7b Mistral AI

CRAFT-MD Leaderboard

The evaluation dataset consists of 2000 questions, each structured as a case vignette followed by four answer choices. Of these, 1800 were sourced from MedQA-USMLE, encompassing medical conditions commonly encountered in primary and specialist care settings. These questions span 12 medical specialties: Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology, and Others. Additional 100 vignettes from an online question back and 100 newly generated vignettes are also included.

Rank	Model	Vignette ↑	Multi-turn Conversation ↑	Single-turn Conversation ↑	Summarized Conversation ↑
1 2024	O1-preview-0912 OpenAI	0.51	0.365	0.212	0.321
2 2024	GPT-4o-0806 OpenAI	0.562	0.335	0.191	0.35
3 2024	O1-mini-0912 OpenAI	0.601	0.301	0.131	0.305
4 2023	GPT-4-1106 OpenAI	0.486	0.264	0.133	0.272
5 2024	LLaMA-3.1-8b Meta AI	0.346	0.187	0.116	0.163
6 2024	LLaMA-3-8b Meta AI	0.291	0.174	0.117	0.16
7 2023	GPT-3.5-1106 OpenAI	0.375	0.169	0.123	0.174
8 2024	Qwen-2.5-7b Alibaba Cloud	0.242	0.117	0.099	0.098
9 2024	Qwen-2-7b Alibaba Cloud	0.174	0.078	0.07	0.065
10 2023	Mistral-v0.1-7b Mistral AI	0.174	0.078	0.062	0.083
11 2023	LLaMA-2-7b Meta AI	0.169	0.066	0.065	0.081
12 2023	Mistral-v0.2-7b Mistral AI	0.222	0.066	0.056	0.056
1 2024	O1-preview-0912 OpenAI	0.931	0.745	0.624	0.768
2 2024	GPT-4o-0806 OpenAI	0.879	0.68	0.57	0.729
3 2024	O1-mini-0912 OpenAI	0.899	0.639	0.508	0.688
4 2024	GPT-4-1106 OpenAI	0.821	0.627	0.52	0.671
5 2024	LLaMA-3.1-8b Meta AI	0.723	0.51	0.421	0.56
6 2024	LLaMA-3-8b Meta AI	0.688	0.497	0.417	0.54
7 2024	GPT-3.5-1106 OpenAI	0.659	0.467	0.435	0.509
8 2024	Qwen-2.5-7b Alibaba Cloud	0.66	0.437	0.411	0.465
9 2024	Mistral-v0.2-7b Mistral AI	0.637	0.426	0.448	0.513
10 2024	Qwen-2-7b Alibaba Cloud	0.591	0.363	0.355	0.431
11 2024	Mistral-v0.1-7b Mistral AI	0.441	0.331	0.324	0.361
12 2023	LLaMA-2-7b Meta AI	0.395	0.319	0.304	0.335

CRAFT-MD

A Conversational Reasoning Assessment Framework for Testing in Medicine.

What is CRAFT-MD?

Leaderboard Overview

CRAFT-MD Leaderboard