CRAFT-MD

A Conversational Reasoning Assessment Framework for Testing in Medicine.

What is CRAFT-MD?

CRAFT-MD is a comprehensive evaluation framework designed to test the conversational reasoning abilities of clinical Large Language Models (LLMs) beyond traditional accuracy measured using MCQs.

It simulates doctor-patient interactions, where the clinical-LLM's performance in gathering medical histories, synthesizing information, and forming accurate diagnoses is assessed by a multi-agent setup involving a patient-AI, a grader-AI, and medical experts who validate the results.

CRAFT-MD is designed to be flexible and scalable, allowing for the integration of new datasets and the evaluation of emerging models.

Submit your models for evaluation with CRAFT-MD.

Leaderboard Overview

The CRAFT-MD leaderboard ranks models based on their performance in Multi-turn Conversations in free response question (FRQ) setting.

Rank Combined Dataset

1

O1-preview-0912

OpenAI

2

GPT-4o-0806

OpenAI

3

O1-mini-0912

OpenAI

4

GPT-4-1106

OpenAI

5

LLaMA-3.1-8b

Meta AI

6

LLaMA-3-8b

Meta AI

7

GPT-3.5-1106

OpenAI

8

Qwen-2.5-7b

Alibaba Cloud

9

Qwen-2-7b

Alibaba Cloud

10

Mistral-v0.1-7b

Mistral AI

11

LLaMA-2-7b

Meta AI

12

Mistral-v0.2-7b

Mistral AI

CRAFT-MD Leaderboard

The evaluation dataset consists of 2000 questions, each structured as a case vignette followed by four answer choices. Of these, 1800 were sourced from MedQA-USMLE, encompassing medical conditions commonly encountered in primary and specialist care settings. These questions span 12 medical specialties: Dermatology, Hematology and Oncology, Neurology, Gastroenterology, Pediatrics and Neonatology, Cardiology, Infectious Disease, Obstetrics and Gynecology, Urology and Nephrology, Endocrinology, Rheumatology, and Others. Additional 100 vignettes from an online question back and 100 newly generated vignettes are also included.

Rank Model Vignette Multi-turn Conversation Single-turn Conversation Summarized Conversation

1

2024
O1-preview-0912

OpenAI

0.51 0.365 0.212 0.321

2

2024
GPT-4o-0806

OpenAI

0.562 0.335 0.191 0.35

3

2024
O1-mini-0912

OpenAI

0.601 0.301 0.131 0.305

4

2023
GPT-4-1106

OpenAI

0.486 0.264 0.133 0.272

5

2024
LLaMA-3.1-8b

Meta AI

0.346 0.187 0.116 0.163

6

2024
LLaMA-3-8b

Meta AI

0.291 0.174 0.117 0.16

7

2023
GPT-3.5-1106

OpenAI

0.375 0.169 0.123 0.174

8

2024
Qwen-2.5-7b

Alibaba Cloud

0.242 0.117 0.099 0.098

9

2024
Qwen-2-7b

Alibaba Cloud

0.174 0.078 0.07 0.065

10

2023
Mistral-v0.1-7b

Mistral AI

0.174 0.078 0.062 0.083

11

2023
LLaMA-2-7b

Meta AI

0.169 0.066 0.065 0.081

12

2023
Mistral-v0.2-7b

Mistral AI

0.222 0.066 0.056 0.056