News

Article

Physicians Should Not Use ChatGPT for Clinical Recommendations, Study Indicates

Author(s):

Key Takeaways

  • GPT-4-turbo and GPT-3.5-turbo underperformed compared to resident physicians in emergency department tasks, except for antibiotic prescriptions.
  • AI models demonstrated high sensitivity but low specificity, often leading to overprescription and false positives.
SHOW MORE

A recent study demonstrated physicians surpass GPT-4- or GPT-3.5 turbo at making clinical recommendations in the emergency department.

Physicians Should Not Use ChatGPT for Clinical Recommendations, Study Indicates

Christopher Y.K. Williams, MD

Credit: LinkedIn

ChatGPT will not be helping the decision-making for physicians any time soon, as a new study demonstrated.1

GPT-4-turbo may have performed tasks better than the earlier version, GPT-3.5-turbo, particularly in predicting the need for antibiotics for a patient in the emergency department, but this language model did not perform better than a resident physician.

Artificial intelligence (AI) in healthcare has been studied across all different specialties, from psychiatry and dermatology to ophthalmology and now hospital medicine. Although AI can help physicians complete their tasks quicker, such as speeding up the diagnosis process, it is not a replacement for a human—especially in the emergency department.

“This is a valuable message to clinicians not to blindly trust these models,” said lead investigator Christopher Y.K. Williams, MD, from Bakar Computational Health Sciences Institute, University of California, San Francisco.2 “ChatGPT can answer medical exam questions and help draft clinical notes, but it’s not currently designed for situations that call for multiple considerations, like the situations in an emergency department.”

Investigators conducted a study to determine whether large language models, such as GPT-4, can provide clinical recommendations for the tasks of admission status, radiological investigation request status, and antibiotic prescription status using clinical notes from the emergency department.1 The team randomly selected 10,000 emergency department visits (out of 351,401 visits) to assess the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across 4 different prompts:

  • Prompt A: “Please return whether the patient should be admitted to hospital/requires radiological investigation/requires antibiotics
  • Prompt B: Added “only suggest… if absolutely required
  • Prompt C: “Removing restrictions on the verbosity of GPT-3.5-turbo response”
  • Prompt D: “Let’s think step by step chain-of-thought prompting”

Both ChatGPT models—GPT-4-turbo and GPT-3.5 turbo performed poorly compared to a physician, with accuracy scores of 8% and 24%, respectively. The language models tended to be extra cautious in their recommendations with high sensitivity.

Prompt A led to high sensitivity and low specificity performance. Prompt B marginally improved the specificity. Prompts C and D were the ones that generated the greatest specificity with limited effect on sensitivity.

The team discovered physician sensitivity was below that of GPT-3.5-turbo responses, but specificity was significantly greater. They observed similar findings when comparing the performance of GPT-4 with a physician, excluding the antibiotic prescription task where the language model surpassed the performance of a physician but had worse sensitivity.

However, after evaluating the language models in a more representative setting using an unbalanced sample of 1000 emergency department visits that reflect the real-world, the accuracy of the resident physician recommendations performed better than GPT-3.5 turbo recommendations for all prompts. The GPT-4 performed better than a physician for the antibiotic prescription status task but worse for admission status and radiological investigation.

The study ultimately revealed AI tended to overprescribe, resulting in many false positive suggestions. This can be harmful not only for the patient but also for the healthcare system itself by impacting hospital resource availability and costs.

Williams explained AI’s tendency to overprescribe could be because models are trained from the internet, and trustworthy medical advice sites are not designed to answer emergency medical questions—only to send readers to a doctor who can address their concerns.

“These models are almost fine-tuned to say, ‘seek medical advice,’ which is quite right from a general public safety perspective,” Williams said.2 “But erring on the side of caution isn’t always appropriate in the ED setting, where unnecessary interventions could cause patients harm, strain resources, and lead to higher costs for patients.”

References

  1. Williams CYK, Miao BY, Kornblith AE, Butte AJ. Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat Commun. 2024;15(1):8236. Published 2024 Oct 8. doi:10.1038/s41467-024-52415-1
  2. When It Comes to Emergency Care, Chatgpt Overprescribes. EurekAlert! October 8, 2024. https://www.eurekalert.org/news-releases/1060326. Accessed October 17, 2024.


Related Videos
Ben Samelson-Jones,Ben Samelson-Jones, MD, PhD: Validating Long-Term Safety of Hemophilia AAV Gene Therapy MD, PhD: Validating Long-Term Safety of Hemophilia AAV Gene Therapy
Françoise Bernaudin, MD: A Decade of Follow-up Reveals allo-SCT Superiority Over SOC for Sickle Cell Anemia
4 experts are featured in this series.
4 experts are featured in this series.
4 experts are featured in this series.
4 experts are featured in this series.
Marlyn Mayo, MD: Improving Pruritus Management in PBC Care
Achieving Quick Responses in Sickle Cell Anemia With Early, Appropriate Hydroxyurea Dosing, with Abena Appiah-Kubi, MD, MPH
Steven W. Pipe, MD: Fitusiran With Anti-Thrombin Modulation Yields Effective Bleed Control, Reduces Infusions
Highlighting the Danger of SCI Progression during iTTP Remission, with Shruti Chaturvedi, MBSS, MS
© 2024 MJH Life Sciences

All rights reserved.