In a recent paper study posted to the arXiv, preprint* server researchers developed and validated a large language model (LLM) aimed at generating helpful feedback on scientific papers. Based on the Generative Pre-trained Transformer 4 (GPT-4) framework, the model was designed to accept raw PDF scientific manuscripts as inputs, which are then processed in a way that mirrors interdisciplinary scientific journals’ review structure. The model focuses on four key aspects of the publication review process – 1. Novelty and significance, 2. Reasons for acceptance, 3. Reasons for rejection, and 4. Improvement suggestions.
Study: Can large language models provide useful feedback on research papers? A large-scale empirical analysis. Image Credit: metamorworks / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
The results of their large-scale systematic analysis highlight that their model was comparable to human researchers in the feedback provided. A follow-up prospective user study among the scientific community found that more than 50% of researchers approaches were happy with the feedback provided, and an extraordinary 82.4% found the GPT-4 feedback more useful than feedback received from human reviewers. Taken together, this work shows that LLMs can complement human feedback during the scientific review process, with LLMs proving even more useful at the earlier stages of manuscript preparation.
A Brief History of ‘Information Entropy’
The conceptualization of applying a structured mathematical framework to information and communication is attributed to Claude Shannon in the 1940s. Shannon’s biggest challenge in this approach was devising a name for his novel measure, a problem circumvented by John von Neumann. Neumann recognized the links between statistical mechanics and Shannon’s concept, proposing the foundation of modern information theory, and devised ‘information entropy.’
Historically, peer scientists have contributed drastically to progress in the field by verifying the content in research manuscripts for validity, accuracy of interpretation, and communication, but they have also proven essential in the emergence of novel interdisciplinary scientific paradigms through the sharing of ideas and constructive debates. Unfortunately, in recent times, given the increasingly rapid pace of both research and personal life, the scientific review process is becoming increasingly laborious, complex, and resource-intensive.
The past few decades have exacerbated this demerit, especially due to the exponential increase in publications and increasing specialization of scientific research fields. This trend is highlighted in estimates of peer review costs averaging over 100 million research hours and over $2.5 billion US dollars annually.
“While a shortage of high-quality feedback presents a fundamental constraint on the sustainable growth of science overall, it also becomes a source of deepening scientific inequalities. Marginalized researchers, especially those from non-elite institutions or resource-limited regions, often face disproportionate challenges in accessing valuable feedback, perpetuating a cycle of systemic scientific inequality.”
These challenges present a pressing and imperative need for efficient and scalable mechanisms that can partially ease the pressure faced by researchers, both those publishing and those reviewing, in the scientific process. Discovering or developing such mechanisms would help reduce the work inputs of scientists, thereby allowing them to devote their resources towards additional projects (not publications) or leisure. Notably, these tools could potentially lead to improved democratization of access across the research community.
Large language models (LLMs) are deep learning machine learning (ML) algorithms that can perform a variety of natural language processing (NLP) tasks. A subset of these use Transformer-based architectures characterized by their adoption of self-attention, differentially weighting the significance of each part of the input (which includes the recursive output) data. These models are trained using extensive raw data and are used primarily in the fields of NLP and computer vision (CV). In recent years, LLMs have increasingly been explored as tools in paper screening, checklist verification, and error identification. However, their merits and demerits as well as the risk associated with their autonomous use in science publication, remain untested.
About the study
In the present study, researchers aimed to develop and test an LLM based on the Generative Pre-trained Transformer 4 (GPT-4) framework as a means of automating the scientific review process. Their model focuses on key aspects, including the significance and novelty of the research under review, potential reasons for acceptance or rejection of a manuscript for publication, and suggestions for research/manuscript improvement. They combined a retrospective and prospective user study to train and subsequently validate their model, the latter of which involved feedback from eminent scientists in various fields of research.
Data for the retrospective study was collected from 15 journals under the Nature group umbrella. Papers were sourced between January 1, 2022, and June 17, 2023, and included 3.096 manuscripts comprising 8,745 individual reviews. Data was additionally collected from the International Conference on Learning Representations (ICLR), a machine-learning-centric publication that employs an open review policy allowing researchers to access accepted and notably rejected manuscripts. For this work, the ICLR dataset comprised 1,709 manuscripts and 6,506 reviews. All manuscripts were retrieved and compiled using the OpenReview API.
Model development began by building upon OpenAI’s GPT-4 framework by inputting manuscript data in PFD format and parsing this data using the ML-based ScienceBeam PDF parser. Since GPT-4 constrains input data to a maximum of 8,192 tokens, the 6,500 tokens obtained from the initial publication (Title, abstract, keywords, etc.) screen were used for downstream analyses. These tokens exceed ICLR’s token average (5,841.46), and approximately half of Nature’s (12,444.06) was used for model training. GPT-4 was coded to provide feedback for each analyzed paper in a single pass.
Researchers developed a two-stage comment-matching pipeline to investigate the overlap between feedback from the model and human sources. Stage 1 involved an extractive text summarization approach, wherein a JavaScript Object Notation (JSON) output was generated to differentially weight specific/key points in manuscripts, highlighting reviewer criticisms. Stage 2 employed semantic text matching, wherein JSONs obtained from both the model and human reviewers were inputted and compared.
“Given that our preliminary experiments showed GPT-4’s matching to be lenient, we introduced a similarity rating mechanism. In addition to identifying corresponding pairs of matched comments, GPT-4 was also tasked with self-assessing match similarities on a scale from 5 to 10. We observed that matches graded as “5. Somewhat Related” or “6. Moderately Related” introduced variability that did not always align with human evaluations. Therefore, we only retained matches ranked “7. Strongly Related” or above for subsequent analyses.”
Result validation was conducted manually wherein 639 randomly selected reviews (150 LLM and 489 humans) identified true positives (accurately identified key points), false negatives (missed key comments), and false positives (split or incorrectly extracted relevant comments) in the GPT-4’s matching algorithm. Review shuffling, a method wherein LLM feedback was first shuffled and then compared for overlap to human-authored feedback, was subsequently employed for specificity analyses.
For the retrospective analyses, pairwise overlap metrics representing GPT-4 vs. Human and Human vs. Human were generated. To reduce bias and improve LLM output, hit rates between metrics were controlled for paper-specific numbers of comments. Finally, a prospective user study was conducted to confirm validation results from the above-described model training and analyses. A Gradio demo of the GPT-4 model was launched online, and scientists were encouraged to upload ongoing drafts of their manuscripts onto the online portal, following which an LLM-curated review was delivered to the uploader’s email.
Users were then requested to provide feedback via a 6-page survey, which included data on the author’s background, general review situation encountered by the author previously, general impressions of LLM review, a detailed evaluation of LLM performance, and comparison with human/s that may have also reviewed the draft.
Study findings
Retrospective evaluation results depicted F1 accuracy scores of 96.8% (extraction), highlighting that the GPT-4 model was able to identify and extract almost all relevant critiques put forth by reviewers in the training and validation datasets used in this project. Matching between GPT-4-generated and human manuscript suggestions was similarly impressive, at 82.4%. LLM feedback analyses revealed that 57.55% of comments suggested by the GPT-4 algorithm were also suggested by at least one human reviewer, suggesting considerable overlap between man and machine (-learning model), highlighting the usefulness of the ML model even in the early stages of its development.
Pairwise overlap metric analyses highlighted that the model slightly outperformed humans with regard to multiple independent reviewers identifying identical points of concern/improvement in manuscripts (LLM vs. human – 30.85%; human vs. human – 28.58%), further cementing the accuracy and reliability of the model. Shuffling experiment results elucidated that the LLM did not generate ‘generic’ feedback and that feedback was paper-specific and tailored to each project, thereby highlighting its efficiency in delivering individualized feedback and saving the user time.
Prospective user studies and the associated survey elucidate that more than 70% of researchers found a “partial overlap” between LLM feedback and their expectations from human reviewers. Of these, 35% found the alignment substantial. Overlap LLM model performance was found to be impressive, with 32.9% of survey respondents finding model performance non-generic and 14% finding suggestions more relevant than expected from human reviewers.
More than 50% (50.3%) of respondents considered LLM feedback useful, with many of them remarking that the GPT-4 model provided novel yet relevant feedback that human reviews had missed. Only 17.5% of researchers considered the model to be inferior to human feedback. Most notably, 50.5% of respondents attested to wanting to reuse the GPT-4 model in the future, prior to manuscript journal submission, emphasizing the success of the model and the worth of future development of similar automation tools to improve the quality of researcher life.
Conclusion
In the present work, researchers developed and trained an ML model based on the GPT-4 transformer architecture to automate the scientific review process and complement the existing manual publication pipeline. Their model was found to be able to match or even exceed scientific experts in providing relevant, non-generic research feedback to prospective authors. This and similar automation tools may, in the future, significantly reduce the workload and pressure facing researchers who are expected to not only conduct their scientific projects but also peer review others’ work and respond to others’ comments on their own. While not intended to replace human input outright, this and similar models could complement existing systems within the scientific process, both improving the efficiency of publication and narrowing the gap between marginalized and ‘elite’ scientists, thereby democratizing science in the days to come.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
Journal reference:
- Preliminary scientific report.
Liang, W., Zhang, Y., Cao, H., Wang, B., Ding, D., Yang, X., Vodrahalli, K., He, S., Smith, D., Yin, Y., McFarland, D., & Zou, J. (2023). Can large language models provide useful feedback on research papers? A large-scale empirical analysis. arXiv e-prints, arXiv:2310.01783, DOI - https://doi.org/10.48550/arXiv.2310.01783, https://arxiv.org/abs/2310.01783