A new study has found that the new ChatGPT artificial intelligence (AI) system can score at or around the approximately 60% passing threshold for the US Medical Licensing Exam (USMLE), with responses that make coherent, internal sense and contain frequent insights. It still can’t replace doctors though.
The relatively high success rate on the exam achieved by AI, just published in the open-access journal PLOS Digital Health by Tiffany Kung, Victor Tseng and colleagues at AnsibleHealth, is entitled “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.”
ChatGPT (Chat Generative Pre-trained Transformer) is known as a large-language model (LLM) and was designed to generate human-like writing by predicting upcoming word sequences. Unlike most chatbots, ChatGPT, which was launched as a prototype last November, can’t search the Internet. It generates text using word relationships predicted by its internal processes.
“The ability to build highly accurate classification models rapidly and regardless of input data type (like images, text and audio) has enabled widespread adoption of applications such as automated tagging of objects and users in photographs, near-human-level text translation, automated scanning in bank ATMs and even the generation of image captions,” they wrote. “While these technologies have made significant impacts across many industries, applications in clinical care remain limited.”
A shortage of structured, machine-readable data
They explained that the proliferation of clinical free-text fields combined with a lack of general interoperability between health IT systems contributes to a shortage of structured, machine-readable data required for the development of deep learning algorithms.
“Even when algorithms applicable to clinical care are developed, their quality tends to be highly variable, with many failing to generalize across settings due to limited technical, statistical and conceptual reproducibility,” the study said.
As a result, the overwhelming majority of successful healthcare applications currently support back-office functions ranging from payment operations, automated prior authorization processing and management of supply chains and cybersecurity threats. “With rare exceptions – even in medical imaging – there are relatively few applications of AI directly used in widespread clinical care today.”
The great promise of AI systems
AI systems today hold great promise to improve medical care and health outcomes, they wrote. “As such, it is crucial to ensure that the development of clinical AI is guided by the principles of trust and explainability. Measuring AI medical knowledge in comparison to that of expert human clinicians is a critical first step in evaluating these qualities.”
Kung and colleagues tested ChatGPT’s performance on the USMLE, a highly standardized and regulated series of three exams (Steps 1, 2CK and 3) required for an American medical license. Taken by medical students and physicians-in-training, the USMLE assesses knowledge spanning most medical disciplines ranging from biochemistry to diagnostic reasoning to bioethics. After screening to remove image-based questions, the authors tested the software on 350 of the 376 public questions available from the June 2022 USMLE release.
After indeterminate responses were removed, ChatGPT scored between 52.4% and 75.0% across the three USMLE exams. The passing threshold each year is approximately 60%. ChatGPT also demonstrated 94.6% concordance across all its responses and produced at least one significant insight (something that was new, non-obvious and clinically valid) for 88.9% of its responses.
Notably, ChatGPT exceeded the performance of PubMedGPT, a counterpart model trained exclusively on biomedical domain literature, which scored 50.8% on an older dataset of USMLE-style questions.
While the relatively small input size restricted the depth and range of analyses, the authors note their findings provide a glimpse of ChatGPT’s potential to enhance medical education, and eventually, clinical practice. For example, they said, clinicians at AnsibleHealth already use ChatGPT to rewrite jargon-heavy reports for easier patient comprehension.
“Reaching the passing score for this notoriously difficult expert exam, and doing so without any human reinforcement, marks a notable milestone in clinical AI maturation,” the authors noted.
Kung concluded that the chatbot’s role in this research went beyond being the study subject. “ChatGPT contributed substantially to the writing of our manuscript. We interacted with ChatGPT much like a colleague, asking it to synthesize, simplify and offer counterpoints to drafts in progress,” they wrote. “All of the co-authors valued ChatGPT’s input.”