In a recent study published in Prostate Cancer and Prostatic Diseases, a group of researchers evaluated the accuracy and quality of Chat Generative Pre-trained Transformers' (ChatGPT) responses on male lower urinary tract symptoms (LUTS) indicative of benign prostate enlargement (BPE) compared to established urological references.
Study: Can ChatGPT provide high-quality patient information on male lower urinary tract symptoms suggestive of benign prostate enlargement? Image Credit: Miha Creative/Shutterstock.com
Background
As patients increasingly seek online medical guidance, major urological associations like the Association of Urology (EAU) and the American Urological Association (AUA) provide high-quality resources. However, modern technologies such as artificial intelligence (AI) are gaining popularity due to their efficiency.
ChatGPT, with over 1.5 million monthly visits, offers a user-friendly, conversational interface. A recent survey showed that 20% of urologists used ChatGPT clinically, with 56% recognizing its potential in decision-making.
Studies on ChatGPT's urological accuracy show mixed results. Further research is needed to comprehensively evaluate the effectiveness and reliability of AI tools like ChatGPT in delivering accurate and high-quality medical information.
About the study
The present study examined EAU and AUA patient information websites to identify key topics on BPE, formulating 88 related questions.
These questions covered definitions, symptoms, diagnostics, risks, management, and treatment options. Each question was independently submitted to ChatGPT, and the responses were recorded for comparison with the reference materials.
Two examiners classified ChatGPT's responses as true negative (TN), false negative (FN), true positive (TP), or false positive (FP). Discrepancies were resolved by consensus or consultation with a senior specialist.
Performance metrics, including F1 score, precision, and recall, were calculated to assess accuracy, with the F1 score used for its reliability in evaluating model accuracy.
General quality scores (GQS) were assigned using a 5-point Likert scale, assessing the truthfulness, relevancy, structure, and language of ChatGPT's responses. Scores ranged from 1 (false or misleading) to 5 (extremely accurate and relevant). The mean GQS from the two examiners was used as the final score for each question.
Examiner agreement on GQS scores was measured using the interclass correlation coefficient (ICC), and differences were assessed with the Wilcoxon signed-rank test, with a p-value of less than 0.05 considered significant. Analyses were conducted using SAS version 9.4.
Study results
ChatGPT addressed 88 questions across eight categories related to BPE. Notably, 71.6% of the questions (63 out of 88) focused on BPE management, including conventional surgical interventions (27 questions), minimally invasive surgical therapies (MIST, 21 questions), and pharmacotherapy (15 questions).
ChatGPT generated responses to all 88 questions, totaling 22,946 words and 1,430 sentences. In contrast, the EAU website contained 4,914 words and 200 sentences, while the AUA patient guide had 3,472 words and 238 sentences. The AI-generated responses were almost three times longer than the source materials.
The performance metrics of ChatGPT’s responses varied, with F1 scores ranging from 0.67 to 1.0, precision scores from 0.5 to 1.0, and recall from 0.9 to 1.0.
The GQS ranged from 3.5 to 5. Overall, ChatGPT achieved an F1 score of 0.79, a precision score of 0.66, and a recall score of 0.97. The GQS scores from both examiners had a median of 4, with a range of 1 to 5.
The examiners found no statistically significant difference between the scores they assigned to the overall quality of the responses, with a p-value of 0.72. They determined a good level of agreement between them, reflected by an ICC of 0.86.
Conclusions
To summarize, ChatGPT addressed all 88 queries, with performance metrics consistently above 0.5, and an overall GQS of 4, indicating high-quality responses. However, ChatGPT's responses were often excessively lengthy.
Accuracy varied by topic, excelling in BPE concepts but less in minimally invasive surgical therapies. The high level of agreement between examiners on the quality of the responses underscores the reliability of the evaluation process.
As AI continues to evolve, it holds promise for enhancing patient education and support, but ongoing assessment and improvement are essential to maximize its utility in clinical settings.