Both care providers and patients use the internet to obtain quick healthcare information. Therefore, it is not surprising that fertility-oriented content has been explored extensively over the years. Unfortunately, although millions of results show up in a single Google search for the word “infertility,” the medical accuracy of this content is not verified.
Advancements in Natural Language Processing (NLP), a branch of Artificial Intelligence (AI), have enabled computers to learn and use human language to communicate. Recently, OpenAI has developed an AI chatbot called ChatGPT, which enables human users to have conversations with a computer interface.
Study: The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations
A recent Fertility and Sterility study used fertility as a domain to test ChatGPT’s performance and assess its usage as a clinical tool.
The recent evolution of ChatGPT
The uniqueness of ChatGPT can be attributed to its capacity to perform language tasks, such as writing articles, answering questions, or even telling jokes. These features were developed following recent advancements in new deep learning (DL) algorithms.
For example, Generative Pretrained Transformer 3 (GPT-3) is a DL algorithm, which is notable for its vast amount of training data set of 57 billion words and 175 billion parameters from varied sources.
In November 2022, ChatGPT was initially released as an updated version of the GPT-3.5 model. Thereafter, it became the fastest-growing app of all time, acquiring over 100 million users in the two months of its release.
Although there is a possibility of using ChatGPT as a clinical tool for patients to access medical information, there are some limitations in using this model for clinical information.
As of February 2023, ChatGPT was trained with data until 2021; therefore, it is not equipped with the latest data. In addition, one of the critical concerns regarding its use is the production of plagiarized and inaccurate information.
Due to the ease of use and human-like language, patients are enticed to use this application to ask questions regarding their health and receive answers. Therefore, it is imperative to characterize this model’s performance as a clinical tool and elucidate whether it provides misleading answers.
About the study
The current study tested ChatGPT “Feb 13” version to evaluate its consistency in answering fertility-related clinical questions that a patient might ask the chatbot. The performance of ChatGPT was assessed based on three domains.
The first domain was associated with frequently asked questions about infertility on the United States Centers for Disease Control and Prevention (CDC) website. A total of 17 frequently asked questions, such as “what is infertility?” or “how do doctors treat infertility?” were considered.
These questions were entered in ChatGPT during a single session. Answers produced by ChatGPT were compared with the answers provided by CDC.
The second domain utilized important surveys related to fertility. The Cardiff Fertility Knowledge Scale (CFKS) questionnaire, which includes questions about fertility, misconceptions, and risk factors for impaired fertility, was used for this domain. In addition, the Fertility and Infertility Treatment Knowledge Score (FIT-KS) survey questionnaire was also used to assess ChatGPT performance.
The third domain focused on assessing the chatbot’s ability to reproduce the clinical standard in providing medical advice. This domain was structured based on the American Society for Reproductive Medicine (ASRM) Committee Opinion “Optimizing Natural Fertility.”
Study findings
ChatGPT provided answers to first domain questions that resembled the responses provided by CDC about infertility. The mean length of responses provided by the CDC and ChatGPT were the same.
While analyzing the reliability of the content provided by ChatGPT, no significantly different facts were found between CDC data and answers produced by ChatGPT. No differential sentiment polarity and subjectivity were observed. Notably, only 6.12% of ChatGPT factual statements were identified as incorrect, whereas one statement was cited as a reference.
In the second domain, ChatGPT achieved high scores corresponding to the 87th percentile of Bunting’s 2013 international cohort for the CFKS and the 95th percentile based on Kudesia’s 2017 cohort for the FIT-KS. For all questions, ChatGPT provided a context and justification for its answer choices. Additionally, ChatGPT produced an inconclusive answer only once, and the answer was considered to be neither correct nor incorrect.
In the third domain, ChatGPT reproduced missing facts for all seven summary statements from “Optimizing Natural Fertility.” For each response, ChatGPT underscored the fact removed from the statement and did not provide disagreeing facts. In this domain, consistent results were obtained across all repeat administrations.
Limitations
The current study has several limitations, including the evaluation of only one version of ChatGPT. Recently, the launch of similar models, such as AI-powered Microsoft Bing and Google Bard, will allow patients to access alternative chatbots. Therefore, the nature and availability of these modes are subject to rapid changes.
While providing prompt responses, there is a possibility that ChatGPT may utilize data from unreliable references. In addition, the consistency of the model may be affected during the next iteration. Therefore, it is also important to characterize the volatility in model response with various updated data.