In a recent study published in Npj Digital Medicine, researchers compare abstracts from studies published in high-impact medical journals with abstracts developed using the artificial intelligence (AI) large language model ChatGPT to evaluate the accuracy and reliability of using such large language models for scientific writing.
Study: Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. Image Credit: NicoElNino / Shutterstock.com
Background
The recent release of ChatGPT by OpenAI has received substantial attention due to its utility, as well as controversies surrounding its use in academia. Although many users have had positive experiences while using ChatGPT, others have expressed reservations about its increased use and the decline of traditional writing methods.
Large language models, of which ChatGPT is one of the largest, are based on neural network-based models trained on large-scale data to produce content that reads naturally. Generative Pretrained Transformer-3 (GPT-3), on which ChatGPT is built, is trained using 175 billion parameters and can produce coherent and fluent content that is often difficult to distinguish from content created by humans.
As an open-access platform, ChatGPT is available for free public use and is being widely used to create academic content across all fields of science, including biomedical research. However, given the fact that biomedical research has long-standing implications for various aspects of human health and medicine, it is essential to determine the accuracy and reliability of the content created using ChatGPT.
About the study
In the present study, researchers obtained 50 abstracts from five medical journals with high impact factors and used those abstracts as the control. The five journals from which abstracts and titles were obtained included Nature Medicine, Lancet, BMJ, JAMA, and NEJM.
For the test group, ChatGPT was used to generate 50 abstracts based on selected journals and titles from the list. To this end, the researchers asked ChatGPT to create an abstract for a study with a given title in the style of a given journal.
The two sets of abstracts were compared using GPT-2 Output Detector, an AI output detector. To this end, a higher score was given to text believed to have been generated using an AI language tool. Free and paid plagiarism-checking tools were also used to detect the percentage of plagiarism in the ChatGPT-produced and original abstracts.
Blinded human reviewers were also used to assess whether they could differentiate between the actual and ChatGPT-created abstracts. Each reviewer was assigned 25 abstracts consisting of a mix of original and ChatGPT-generated abstracts and was asked to assign a score based on whether they found the abstract generated or original. The ability of the ChatGPT-produced abstracts to adhere to journal guidelines was also assessed.
ChatGPT abstracts lack authenticity
The abstracts generated through ChatGPT showed a high probability of being AI-generated, with a median score of 99.89%. Comparatively, original abstracts had a median score of 0.02%, thus indicating that these abstracts had a low probability of being generated using an AI language tool.
However, plagiarism tools reported a higher percentage match score for original abstracts. Comparatively, AI-generated abstracts had a median similarity plagiarism score of 27 as compared to the 100 reported by original abstracts for plagiarism.
Blinded human reviewers identified about 64% of the generated abstracts as having been generated using ChatGPT. Out of the original abstracts, 86% were correctly identified as being original abstracts by the human reviewers.
The human reviewers misidentified about 32% of AI-generated abstracts as being original abstracts. However, the GPT-2 Output Detector reported similar scores for all ChatGPT-generated abstracts.
About 14% of the original abstracts were misidentified as ChatGPT-generated, thus indicating that human reviewers had difficulty distinguishing between original scientific literature and those generated by AI. Furthermore, the human reviewers commented that for abstracts they had correctly identified as being ChatGPT-generated, they found the abstracts vague and superficial, with an undue focus on information such as alternate spellings of some words or registration numbers for clinical trials.
Conclusions
While the AI output detection tools successfully identified ChatGPT-generated abstracts and distinguished these from original abstracts, human reviewers often had difficulty distinguishing between the two. These findings indicate the utility of AI output for journals and scientific publishing houses to maintain scientific standards for publications.
Journal reference:
- Gao, C. A., Howard, F. M., Markov, N. S., et al. (2023). Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. Npj Digital Medicine 6(75). doi:10.1038/s41746-023-00819-6.