Photo: Blend Images - LWA/Dann Tardif/Getty Images
Researchers concluded that the strong performance of GPT-4 large language models (LLMs) suggests their potential as a tool to help fill gaps in physicians' knowledge and aid in diagnosing medical conditions in the future, according to a study published in the Journal of Medical Artificial Intelligence.
Physical examinations are important diagnostic tools that can disclose critical insights into a patient's health; however, complex conditions may be overlooked if a clinician lacks specialized training in that area.
Although previous research has investigated using LLMs as tools to aid in providing diagnoses, their use in physical exams remains untapped.
To address this gap, researchers from Mass General Brigham prompted the GPT-4 LLM to recommend physical exam instructions based on patient symptoms. The study suggests the potential of using LLMs to aid clinicians during physical exams.
"Medical professionals early in their career may face challenges in performing the appropriate patient-tailored physical exam because of their limited experience or other context-dependent factors, such as lower resourced settings," senior author Dr. Marc D. Succi said in a statement.
Succi, an associate chair of innovation and commercialization for enterprise radiology and executive director of the Medically Engineered Solutions in Healthcare (MESH) Incubator at Mass General Brigham, added, "LLMs have the potential to serve as a bridge and parallel support physicians and other medical professionals with physical exam techniques and enhance their diagnostic abilities at the point of care."
Succi and colleagues "prompted GPT-4 to recommend physical exam instructions based on the patient's primary symptom, for example, a painful hip. GPT-4's responses were then evaluated by three attending physicians on a scale of 1 to 5 points based on accuracy, comprehensiveness, readability and overall quality."
The researchers discovered that GPT-4 performed well at providing instructions, scoring at least 80% of the possible points. The highest score was for "Leg Pain Upon Exertion" and the lowest was for "Lower Abdominal Pain."
Lead author Arya Rao, a student researcher in the MESH Incubator attending Harvard Medical School, said in a statement, "GPT-4 performed well in many respects, yet its occasional vagueness or omissions in critical areas, like diagnostic specificity, remind us of the necessity of physician judgment to ensure comprehensive patient care."
WHY IT MATTERS
The researchers noted limitations, stating that although GPT-4 gave detailed responses, they found it sometimes left out key instructions or was overly vague, indicating the need for a human evaluator.
Study authors demonstrated the potential of off-the-shelf LLMs as adjunctive diagnostic tools or providing clinically relevant physical exam recommendations based on chief complaints.
In the future, study authors concluded that "real-world patient cases could be used to fine-tune LLMs for a large and diverse set of specific clinical scenarios that could help address the observed gaps in the diagnostic capacity of GPT-4."
The researchers also expect a greater role for LLMs in clinical decision support, helping to fill knowledge gaps and serving as an academic tool for emerging medical professionals, which could enhance physicians' diagnostic capacity.