AI chatbot ChatGPT has been shown to be able to nearly pass the radiology board exam, according to a study conducted earlier this month.
Radiology is a "branch of medicine that uses imaging technology to diagnose and treat disease," as described by the MedlinePlus Medical Encyclopedia.
The study, published in the peer-reviewed journal Radiology, was conducted between late February and early March of this year, where the AI chatbot was subjected to an organized 150 multiple-choice questionnaire that is "designed to match the style, content, and difficulty of the Canadian Royal College and American Board of Radiology examinations."
The AI chatbot's performance was evaluated by topic and by question type.
The types of questions that were given were divided into multiple categories. For example, there are lower-order questions (having to do with recalling and understanding) and higher-order questions (requiring application and analysis to answer). It should be noted that the radiology questions did not include images.
Overall, it answered 69% of the questions correctly, just 1% less than the passing threshold of the exam, with only 46 wrong answers of the 150 questions it was given.
ChatGPT performed better on questions that were considered lower-order thinking, answering 84% correctly on those questions. The AI chatbot answered correctly only 60% of all questions that were considered higher-order thinking.
The chatbot showed to be less efficient at tackling questions involving calculation, classification, and conceptual application. The study stated that the chatbot "used confident language consistently, even when incorrect."
“The use of large language models like ChatGPT is exploding and only going to increase,” said the study's lead author Rajesh Bhayana, MD. "Our research provides insight into ChatGPT’s performance in a radiology context, highlighting the incredible potential of large language models, along with the current limitations that make it unreliable.”
"Our research provides insight into ChatGPT’s performance in a radiology context, highlighting the incredible potential of large language models, along with the current limitations that make it unreliable.”
Rajesh Bhayana
GPT-4's performance in Radiology
A separate study also looked at GPT-4, OpenAI’s latest LLM, and its performance on the radiology board exam, stating that it demonstrated vast improvements compared to the GPT-3.5, on which ChatGPT based itself.
The GPT-4 was tested on the same 150 questions that its predecessor was examined on, and researchers compared its performance with that of Chat-GPT. In total, GPT-4 passed the exam by answering 121 questions correctly (81%), more than its predecessor which only answered 104 correctly.
The GPT-4 performed better than the GPT-3.5 on higher-order thinking questions and questions tackling imaging findings and conceptual application but showed no improvement on lower-order thinking questions. Furthermore, despite doing better, it incorrectly answered 12 questions that the GPT-3.5 got right, with nine of them being lower-order questions.
The study concludes with the "impressive improvement in the performance of ChatGPT in radiology over a short time period," in which researchers stress the "growing potential of LLMs."
However, noting the lack of improvement in lower-order questions, questions have been raised regarding the improved LLM's reliability for information gathering.