AI vs. human: Study shows ChatGPT falls short in accuracy and authenticity of scientific abstracts

In a recent study published in Npj Digital Medicine, researchers compare abstracts from studies published in high-impact medical journals with abstracts developed using the artificial intelligence (AI) large language model ChatGPT to evaluate the accuracy and reliability of using such large language models for scientific writing.

Study: Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. Image Credit: NicoElNino / Shutterstock.com Study: Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. Image Credit: NicoElNino / Shutterstock.com

Background

The recent release of ChatGPT by OpenAI has received substantial attention due to its utility, as well as controversies surrounding its use in academia. Although many users have had positive experiences while using ChatGPT, others have expressed reservations about its increased use and the decline of traditional writing methods.

Large language models, of which ChatGPT is one of the largest, are based on neural network-based models trained on large-scale data to produce content that reads naturally. Generative Pretrained Transformer-3 (GPT-3), on which ChatGPT is built, is trained using 175 billion parameters and can produce coherent and fluent content that is often difficult to distinguish from content created by humans.

As an open-access platform, ChatGPT is available for free public use and is being widely used to create academic content across all fields of science, including biomedical research. However, given the fact that biomedical research has long-standing implications for various aspects of human health and medicine, it is essential to determine the accuracy and reliability of the content created using ChatGPT.

About the study

In the present study, researchers obtained 50 abstracts from five medical journals with high impact factors and used those abstracts as the control. The five journals from which abstracts and titles were obtained included Nature Medicine, Lancet, BMJ, JAMA, and NEJM.

For the test group, ChatGPT was used to generate 50 abstracts based on selected journals and titles from the list. To this end, the researchers asked ChatGPT to create an abstract for a study with a given title in the style of a given journal.

The two sets of abstracts were compared using GPT-2 Output Detector, an AI output detector. To this end, a higher score was given to text believed to have been generated using an AI language tool. Free and paid plagiarism-checking tools were also used to detect the percentage of plagiarism in the ChatGPT-produced and original abstracts.

Blinded human reviewers were also used to assess whether they could differentiate between the actual and ChatGPT-created abstracts. Each reviewer was assigned 25 abstracts consisting of a mix of original and ChatGPT-generated abstracts and was asked to assign a score based on whether they found the abstract generated or original. The ability of the ChatGPT-produced abstracts to adhere to journal guidelines was also assessed.

ChatGPT abstracts lack authenticity

The abstracts generated through ChatGPT showed a high probability of being AI-generated, with a median score of 99.89%. Comparatively, original abstracts had a median score of 0.02%, thus indicating that these abstracts had a low probability of being generated using an AI language tool.

However, plagiarism tools reported a higher percentage match score for original abstracts. Comparatively, AI-generated abstracts had a median similarity plagiarism score of 27 as compared to the 100 reported by original abstracts for plagiarism.

Blinded human reviewers identified about 64% of the generated abstracts as having been generated using ChatGPT. Out of the original abstracts, 86% were correctly identified as being original abstracts by the human reviewers.

The human reviewers misidentified about 32% of AI-generated abstracts as being original abstracts. However, the GPT-2 Output Detector reported similar scores for all ChatGPT-generated abstracts.

About 14% of the original abstracts were misidentified as ChatGPT-generated, thus indicating that human reviewers had difficulty distinguishing between original scientific literature and those generated by AI. Furthermore, the human reviewers commented that for abstracts they had correctly identified as being ChatGPT-generated, they found the abstracts vague and superficial, with an undue focus on information such as alternate spellings of some words or registration numbers for clinical trials.

Conclusions

While the AI output detection tools successfully identified ChatGPT-generated abstracts and distinguished these from original abstracts, human reviewers often had difficulty distinguishing between the two. These findings indicate the utility of AI output for journals and scientific publishing houses to maintain scientific standards for publications.

Journal reference:
  • Gao, C. A., Howard, F. M., Markov, N. S., et al. (2023). Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. Npj Digital Medicine 6(75). doi:10.1038/s41746-023-00819-6.
Dr. Chinta Sidharthan

Written by

Dr. Chinta Sidharthan

Chinta Sidharthan is a writer based in Bangalore, India. Her academic background is in evolutionary biology and genetics, and she has extensive experience in scientific research, teaching, science writing, and herpetology. Chinta holds a Ph.D. in evolutionary biology from the Indian Institute of Science and is passionate about science education, writing, animals, wildlife, and conservation. For her doctoral research, she explored the origins and diversification of blindsnakes in India, as a part of which she did extensive fieldwork in the jungles of southern India. She has received the Canadian Governor General’s bronze medal and Bangalore University gold medal for academic excellence and published her research in high-impact journals.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Sidharthan, Chinta. (2023, June 27). AI vs. human: Study shows ChatGPT falls short in accuracy and authenticity of scientific abstracts. News-Medical. Retrieved on September 12, 2024 from https://www.news-medical.net/news/20230627/AI-vs-human-Study-shows-ChatGPT-falls-short-in-accuracy-and-authenticity-of-scientific-abstracts.aspx.

  • MLA

    Sidharthan, Chinta. "AI vs. human: Study shows ChatGPT falls short in accuracy and authenticity of scientific abstracts". News-Medical. 12 September 2024. <https://www.news-medical.net/news/20230627/AI-vs-human-Study-shows-ChatGPT-falls-short-in-accuracy-and-authenticity-of-scientific-abstracts.aspx>.

  • Chicago

    Sidharthan, Chinta. "AI vs. human: Study shows ChatGPT falls short in accuracy and authenticity of scientific abstracts". News-Medical. https://www.news-medical.net/news/20230627/AI-vs-human-Study-shows-ChatGPT-falls-short-in-accuracy-and-authenticity-of-scientific-abstracts.aspx. (accessed September 12, 2024).

  • Harvard

    Sidharthan, Chinta. 2023. AI vs. human: Study shows ChatGPT falls short in accuracy and authenticity of scientific abstracts. News-Medical, viewed 12 September 2024, https://www.news-medical.net/news/20230627/AI-vs-human-Study-shows-ChatGPT-falls-short-in-accuracy-and-authenticity-of-scientific-abstracts.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Artificial intelligence predicts tongue disease with 96 percent accuracy