In a recent study posted to the medRxiv* preprint server, researchers evaluate the accuracy and reproducibility of responses from ChatGPT versions 3.5 and 4 in answering heart failure-related questions.
Study: Appropriateness of ChatGPT in answering heart failure related questions. Image Credit: SuPatMaN / Shutterstock.com
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Background
By 2030, researchers estimate that healthcare costs associated with heart failure will reach around $70 billion USD every year in the United States. About 70% of these costs are due to hospitalizations, which constitute 1-2% of all hospital admissions in the United States. Studies have shown that patients who possess more knowledge about managing their heart condition tend to have fewer and shorter hospital stays.
With the increasing use of online resources for health information, nearly one billion healthcare-related questions are searched on Google every day. One notable artificial intelligence (AI) model known as Chat Generative Pre-Trained Transformer (ChatGPT) has recently gained popularity.
ChatGPT is a large language model (LLM) that has been trained on a diverse dataset, including medical topics, and can provide conversational responses to user queries. The medical community is actively investigating the utility of ChatGPT and similar models in the field of medicine by evaluating its knowledge and reasoning capabilities.
About the study
In the current study, researchers collected a list of 125 commonly asked questions about heart failure from reputable medical organizations and Facebook support groups. After careful evaluation, 18 questions with duplicate content, vague phrasing, or did not address the patient’s perspective were eliminated.
The remaining 107 questions were then inputted twice into both versions of ChatGPT using the “new chat” feature, which led to the generation of two responses for every question from each model.
To assess the accuracy of the responses, two board-certified cardiologists independently graded them using a scale consisting of four categories ranging from comprehensive, correct but inadequate, some correct and some incorrect, and completely incorrect. This evaluation process was performed for both ChatGPT-3.5 and ChatGPT-4 responses. The reproducibility of the responses was also evaluated by comparing the comprehensive and accuracy scores for both responses for each question from each model.
Any discrepancies in grading between the reviewers were resolved by a third reviewer who is a board-certified specialist in advanced heart failure with over 20 years of clinical experience.
Study results
The evaluation of responses from both ChatGPT models revealed that most responses were considered ‘comprehensive’ or ‘correct but inadequate.’ ChatGPT-4 exhibited a greater depth of comprehensive knowledge in the categories of ‘management’ and ‘basic knowledge’ as compared to ChatGPT-3.5.
The performance of ChatGPT-3.5 was better in the ‘other’ category, which encompassed topics like support prognosis and procedures. For example, ChatGPT-3.5 provided a general answer about the cardiac benefits of sodium-glucose cotransporter-2 (SGLT2) inhibitors, whereas ChatGPT-4 offered a more detailed yet concise response regarding the impact of these agents on diuresis and blood pressure.
About 2% of responses from ChatGPT-3.5 was graded as ‘some correct and some incorrect,’ while no responses from ChatGPT-4 fell into this category or the ‘completely incorrect’ category. When examining reproducibility, both models provided consistent responses for most questions, with the ChatGPT-3.5 version scoring more than 94% in all categories and GPT-4 achieving 100% reproducibility for all answers.
Conclusions
The present study reported that ChatGPT-4 demonstrated superior performance as compared to ChatGPT-3.5 by providing more comprehensive responses to heart-failure-related questions without any incorrect answers. Both models exhibited high reproducibility for most questions. These findings highlight the impressive capabilities and rapid advancement of LLMs in providing reliable and comprehensive information to patients.
ChatGPT has the potential to serve as a valuable resource for people with heart conditions by empowering them with knowledge under the guidance of healthcare providers. The user-friendly interface and human-like conversational responses make ChatGPT an appealing tool for patients seeking health-related information. The improved performance of ChatGPT-4 can be attributed to improved training, which focuses on better understanding user intent and handling complex scenarios.
While ChatGPT performed well in this study, there are important limitations to consider. Occasionally, the model may provide inaccurate but believable responses and, at times, nonsensical answers.
The accuracy of the model relies on its training dataset, which has not been disclosed, and recommendations may vary across various regions. Additional limitations include the inability to blind the reviewers to the versions of ChatGPT and the potential for bias introduced through subjective review, despite the use of a panel of multiple reviewers.
Further research and exploration of ChatGPT’s capabilities and limitations are recommended to maximize its potential impact on improving patient outcomes.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.